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Preface 



This proceedings contains the papers presented at the 2004 IFIP International 
Conference on Network and Parallel Computing (NPC 2004), held at Wuhan, 
China, from October 18 to 20, 2004. The goal of the conference was to establish 
an international forum for engineers and scientists to present their ideas and 
experiences in network and parallel computing. 

A total of 338 submissions were received in response to the call for papers. 
These papers were from Australia, Brazil, Canada, China, Finland, France, Ger- 
many, Hong Kong, India, Iran, Italy, Japan, Korea, Luxemburg, Malaysia, Nor- 
way, Spain, Sweden, Taiwan, UK, and USA. Each submission was sent to at 
least three reviewers. Each paper was judged according to its originality, innova- 
tion, readability, and relevance to the expected audience. Based on the reviews 
received, a total of 69 papers were accepted to be included in the proceedings. 
Among the 69 papers, 46 were accepted as full papers and were presented at the 
conference. We also accepted 23 papers as short papers; each of these papers was 
given an opportunity to have a brief presentation at the conference, followed by 
discussions in a poster session. Thus, due to the limited scope and time of the 
conference and the high number of submissions received, only 20% of the total 
submissions were included in the final program. 

In addition to the contributed papers, the NPC 2004 program featured sev- 
eral keynote speakers: Kai Hwang from the University of Southern California, 
Jos Fortes from the University of Florida, Thomas Sterling from the California 
Institute of Technology, Bob Kuhn from Intel, and Elmootazbellah Elnozahy 
from the IBM Austin Research Center. In addition, there was an invited session 
with several invited talks. The keynote and invited speakers were selected due 
to their significant contributions and reputations in the field. 

Also associated with NPC 2004 were a tutorial session and two workshops, 
the Workshop on Building Intelligent Sensor Networks, and the Workshop on 
Multimedia Modeling and the Security in the Next Generation Network Infor- 
mation Systems. 

The IFIP NPC 2004 conference emerged from initial email exchanges be- 
tween Kemal Ebcioglu, Guojie Li, and Guang R. Gao in 2002, with the vision 
toward establishing a new, truly international conference for fostering research 
and collaboration in parallel computing. We are happy to see that the NPC 
conference, with its eminent team of organizers, and its high-quality technical 
program, is well on its way to becoming a flagship conference of IFIP. 

We wish to thank the contributions of the other members of the organizing 
committee. We acknowledge the solid work by Nelson Amaral for his dedication 
in organizing the tutorial and workshop sessions. We thank the publicity co- 
chairs Cho-Li Wang and Chris Jesshope for their hard work in publicizing the 
NPC 2004 information under a very tight schedule constraint. 
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Preface 



We are deeply grateful to the program committee members. The large num- 
ber of submissions received and the diversified topics and coverage of the topics 
made this review process a particularly challenging one. Also, the program com- 
mittee was working under a very tight schedule. We wish to thank the program 
committee vice-chairs Victor K. Prasanna, Albert Y. Zomaya, and Hai Jin for 
their assistance in organizing the NPC 2004 program and the paper selection 
guidelines. Without the solid work and dedication of the program committee 
members, the success of this program would not have been possible. 

We appreciate the contribution of Hai Jin and his local team at the Huazhong 
University of Science and Technology, Wuhan — in particular, the local chair 
Song Wu and Web chair Li Qi — who organized and handled the conference 
website for paper submissions and the review process, the organization of the 
final program, the design and maintenance of the NPC 2004 conference websites, 
the solicitation of sponsorships and support, and numerous other matters related 
to the local arrangements of the conference. We are deeply impressed by the 
efficiency, professionalism, and dedication of their work. 

We also appreciate other support we received at the Institute of Computing 
Technology (ICT), Beijing, and the University of Delaware. In particular, we 
wish to acknowledge the assistance from Zhenge Qiu and Jiang Yi at ICT, and 
Yingping Zhang and Yanwei Niu at the University of Delaware. 
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Secure Grid Computing with Trusted Resources and 
Internet Datamining 



Kai Hwang 

University of Southern California 
Los Angeles, CA. 90089 USA 



Abstract. Internet-based Grid computing is emerging as one of the most 
promising technologies that may change the world. Dr. Hwang and his research 
team at the University of Southern California (USC) are working on self- 
defense tools to protect Grid resources from cyber attacks or malicious 
intrusions, automatically. This project builds an automated intrusion response 
and trust management system to facilitate authentication, authorization, and 
security binding in using metacomputing Grids or peer-to-peer web services. 
The trusted GridSec infrastructure supports Internet traffic datamining, 
encrypted tunneling, optimized resource allocations, network flood control and 
anomaly detection, etc. The USC team is developing a NetShield library to 
protect Grid resources. This new security system adjusts itself dynamically with 
changing threat patterns and network traffic conditions. This project promotes 
the acceptance of Grid computing through international collaborations with the 
research groups in INRIA, France, Chinese Academy of Sciences, and 
Melbourne University. The fortified Grid infrastructure will benefit 
securitysensitive allocations in digital government, electronic commerce, anti- 
terrorism activities, and cyberspace crime control. The broader impacts of this 
ITR project are far reaching in an era of growing demand of Internet, Web and 
Grid services. 
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Towards Memory Oriented Scalable Computer 
Architecture and High Efficiency Petafiops Computing 



Thomas Sterling 

Center for Advanced Computing Research 
California Institute of Technology 



Abstract. The separation of processor logic and main memory is an artifact of 
the disparities of the original technologies from which each was fabricated more 
than fifty years ago as captured by the “von Neumann architecture”. 
Appropriately, this separation is designated as “the von Neumann bottleneck”. 
In recent years, the underlying technology constraint for the isolation of main 
memory from processing logic has been eliminated with the implementation of 
semiconductor fabrication foundries that permit the merger of both DRAM bit 
cells and CMOS logic on the same silicon dies. New classes of computer 
architecture are enabled by this opportunity including: 1) system on a chip 
where a conventional processor core with its layers of cache are connected to a 
block of DRAM on the same chip, 2) SMP on a chip where multiple 
conventional processor cores are combined on the same chip through a coherent 
cache structure, usually sharing the L3 cache implemented in DRAM, and 3) 
processor in memory where custom processing logic is positioned directly at 
the memory row buffer in a tightly integrated structure to exploit the short 
access latency and wide row of bits (typically 2K) for high memory bandwidth. 
This last, PIM, can take on remarkable physical structures and logical 
constructs and is the focus of the NASA Gilgamesh project to define and 
prototype a new class of PIM-based computer architecture that will enable a 
new scalable model of execution. The MIND processor architecture is the core 
of the Gilgamesh system that incorporates a distributed shared memory 
management scheme including in-memory virtual to physical address 
translation, a lightweight parcels message-driven mechanism for invoking 
remote transaction processing, multithreaded single cycle instruction issue for 
local resource management, graceful degradation for fault tolerance, and pinned 
threads for real time response. The MIND architecture for Gilgamesh is being 
developed in support of “sea of PIMs” systems for both ground based Petafiops 
scale computers and scalable space borne computing for long term autonomous 
missions. One of its specific applications is in the domain of symbolic 
computing for knowledge management, learning, reasoning, and planning in a 
goal directed programming environment. This presentation will describe the 
MIND architecture being developed through the Gilgamesh project and its 
relation to the Cray Cascade Petafiops computer being developed for 2010 
deployment under DARPA sponsorship. 



H. Jin et al. (Eds.): NPC 2004, LNCS 3222, pp. 2-2, 2004. 

© IFIP International Federation for Information Processing 2004 




In- VIGO: Making the Grid Virtually Yours 



Jose Fortes 

Department of Electrical and Computer Engineering, University of Florida 



Abstract. Internet-based Grid computing is emerging as one of the most 
promising technologies that may change the world. Dr. Hwang and his 
research team at the University of Southern California (USC) are working on 
self-defense tools to protect Grid resources from cyber attacks or malicious 
intrusions, automatically. This project builds an automated intrusion response 
and trust management system to facilitate authentication, authorization, and 
security binding in using metacomputing Grids or peer-to-peer web services. 
The trusted GridSec infrastructure supports Internet traffic datamining, 
encrypted tunneling, optimized resource allocations, network flood control and 
anomaly detection, etc. The USC team is developing a NetShield library to 
protect Grid resources. This new security system adjusts itself dynamically with 
changing threat patterns and network traffic conditions. This project promotes 
the acceptance of Grid computing through international collaborations with the 
research groups in INRIA, France, Chinese Academy of Sciences, and 
Melbourne University. The fortified Grid infrastructure will benefit security- 
sensitive allocations in digital government, electronic commerce, anti-terrorism 
activities, and cyberspace crime control. The broader impacts of this ITR 
project are far reaching in an era of growing demand of Internet, Web and Grid 
services. 
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Productivity in HPC Clusters 



Bob Kuhn 

Intel Corp. 

bob . kuhn@intel . com 



Abstract. This presentation discusses HPC productivity in terms of: (1) 
effective architectures, (2) parallel programming models, and (3) applications 
development tools. The demands placed on HPC by owners and users of 
systems ranging from public research laboratories to private scientific and 
engineering companies enrich the topic with many competing technologies and 
approaches. Rather than expecting to eliminate each other in the short run, these 
HPC competitors should be learning from one another in order to stay in the 
race. Here we examine how these competing forces form the engine of 
improvement for overall HPC cost/effectiveness. First, what will the effective 
architectures be? Moore's law is likely to still hold at the processor level over 
the next few years. Those words are, of course, typical from a semiconductor 
manufacturer. More important for this conference, our roadmap projects that it 
will accelerate over the next couple of years due to Chip Multi Processors, 
CMPs. It has also been observed that cluster size has been growing at the same 
rate. Few people really know how successful the Grid and Utility Computing 
will be, but virtual organizations may add another level of parallelism to the 
problem solving process. Second, on parallel programming models, hybrid 
parallelism, i.e. parallelism at multiple levels with multiple programming 
models, will be used in many applications. Hybrid parallelism may emerge 
because application speedup at each level can be multiplied by future 
architectures. But, these applications can also adapt best to the wide variety of 
data and problems. Robustness of this type is needed to avoid high software 
costs of converting or incrementally tuning existing program. This leads to 
OpenMP, MPI, and Grid programming model investments. Third, application 
tools are needed for programmer productivity. Frankly, integrated programming 
environments have not made much headway in HPC. Tools for debugging and 
performance analysis still define the basic needs. The term debugging is used 
advised because there are limits to the scalability of debuggers in the amount of 
code and number of processors even today. How can we breakthrough? 
Maybe through automated tools for finding bugs at the threading and process 
level? Performance analysis capability similarly will be exceeded by the growth 
of hardware parallelism, unless progress is made. 
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Whole-Stack Analysis and Optimization of 
Commercial Workloads on Server Systems 
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Manish Gupta^, Tatsushi Inagaki^, Kazuaki Ishizaki^, Joefon Jann^, 
Robert D. Johnson^, Toshio Nakatani^, II Park^, Pratap Pattnaik^, 
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^ IBM T.J. Watson Research Center 
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Abstract. The evolution of the Web as an enabling tool for e-business 
introduces a challenge to understanding the execution behavior of large- 
scale middleware systems, such as J2EE [2], and their commercial work- 
loads. This paper presents a brief description of the whole-stack analysis 
and optimization system - being developed at IBM Research - for com- 
mercial workloads on Websphere Application Server (WAS) [5] - IBM’s 
implementation of J2EE - running on IBM’s pSeries [4] and zSeries[3] 
server systems. 



1 Introduction 

Understanding the execution behavior of a software or hardware system is cru- 
cial for improving its execution performance (i.e., optimization or performance 
tuning). The evolution of the Web as an enabling tool for e-business introduces 
a challenge to understanding the execution behavior of large-scale middleware 
systems, such as J2EE [2], and their applications. 

J2EE is a collection of Java interfaces and classes for business applications. 
J2EE implementations provide a container that hosts Enterprise JavaBeans 
(EJBs), from which J2EE applications are constructed. The J2EE container 
is like an operating system for EJBs, providing services such as database access 
and messaging, as well as managing resources like threads and memory. 

J2EE is a Java application running on a Java Virtual Machine (JVM). The 
JVM in turn is like an operating system for J2EE and its applications, provid- 
ing services such as synchronizations and memory management. JVM, typically 
written in C or C-| — h, is an application of the underlying operating system, 

* Contact author 

** Also at Univ. of Illinois at Urbana-Champaign 
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Fig. 1. Performance Analysis and Optimization Methodology 



which itself is a client of the underlying hardware. The complicated interactions 
between the various layers of the software stack, from the J2EE applications 
to J2EE to JVM and to the operating system, and the underlying hardware 
layer, is a major source of the challenge to understanding and optimizing the 
performance of large-scale middleware systems and their applications. 

This paper presents a brief description of the whole-stack analysis and op- 
timization system - being developed at IBM Research - for commercial work- 
loads on Websphere Application Server (WAS) running on IBM’s pSeries [4] and 
zSeries[3] server systems. WAS is IBM’s implementation of the J2EE middleware. 

2 Whole-Stack Analysis and Optimization System 

Figure 1 shows the performance analysis and optimization methodology of the 
system. The methodology provides means for: 

1. instrumenting the software layers to collect statistics about software events 
at source level, 

2. instrumenting hardware to collect statistics about various hardware events, 
and more importantly 

3. correlating these two so that one can see the hardware events that correspond 
to a software event, and vice versa. 

This enables the detection of hot-spots at either level and initiate generation of 
corresponding events at the other level. 

The system employs static and dynamic instrumentation of the whole soft- 
ware stack, and generates trace of various software and hardware runtime events 
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Fig. 2. Correlation between Data TLB (DTLB) Misses and CPI 



both for online reduction and offline evaluation. Two major analyses are then 
applied to the trace: the spatial code analysis and the temporal event correlation. 

The spatial code analysis identifies the static code body whose execution 
behavior is of interest; it identifies methods or basic blocks in a method that 
incur high execution overhead. A major tool of this analysis is the static and 
dynamic call graph and the call tree. The temporal event correlation correlates 
runtime events and performance metrics from the software and the hardware 
layers that are of interests; it identifies the relationship among various runtime 
events, such as Java garbage collection or thread synchronizations, and perfor- 
mance metrics such as cycles per instruction (CPI). Figure 2 shows an example 
performance metric trace that exhibits the correlation between CPI and the 
data TLB misses. The trace is generated by IBM’s hardware performance mon- 
itor (hpm) tool kits [1]. Figure 3 shows the system layers, and examples of the 
performance metrics available for each layer. 

The results from the spatial code analysis and the temporal event correlation 
are combined to identify static code bodies responsible for relatively poor perfor- 
mance, and potential remedies for unsatisfactory performance. Based on these 
findings, the system is redesigned and modified, and the cycle of performance 
analysis and optimization repeats until the performance becomes satisfactory. 

3 Conclusion 

We have presented a brief description of the whole-stack analysis and optimiza- 
tion system, being developed at IBM Research, for commercial workloads on 
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System Layers Metrics Examples 

Transaction response time, 

Application specific events 

# of executing beans, # of activated beans 

# JDBC Connections. Message queue iengths 

Garbage Coliections, Heap free space. Object allocations, 

JIT compilations. Java monitor contention 

Context switches. Page faults, CPU utilization, 

SVC Calls. Disk 10, Network utilization 

Hardware Performance Counters: 

# of instructions. # loads / stores, # cache misses, # of TLB misses 

Fig. 3. System Layers and Performance Metrics 



Java Application 



WebSphere (J2EE) 



Java VM 



OS 



HW 



Websphere Application Server (WAS) running on IBM’s pSeries [4] and zSeries[3] 
server systems. Based on the analysis results we have obtained so far, we are 
currently experimenting with several optimizations at the various software stack 
layers, from WAS to the operating system, and also at the hardware. We will 
publish the results of these optimizations in the future. 



References 

[1] HPM Tool Kit. http://www.alphaworks.ibm.com/tech/hpmtoolkit. 

[2] Java 2 Platform, Enterprise Edition (J2EE). http://java.sun.com/j2ee. 

[3] Mainframe servers: zSeries. http://www-l.ibm.com/servers/eserver/zseries. 

[4] IBM pSeries Information Center, http://publibl6.boulder.ibm.com/pseries/ 
en_US/ inf ocenter /base. 

[5] WebSphere Application Server, http://www.ibm.com/websphere. 





Fuzzy Trust Integration for Security 
Enforcement in Grid Computing* 



Shanshan Song, Kai Hwang, and Mikin Macwan 



Internet and Grid Computing Laboratory 
University of Southern California, Los Angeles, CA. 90089 USA 
{shanshas, kaihwana)®usc . edu 



Abstract. How to build the mutual trust among Grid resources sites is crucial 
to secure distributed Grid applications. We suggest enhancing the trust index of 
resource sites by upgrading their intrusion defense capabilities and checking the 
success rate of jobs running on the platforms. We propose a new fuzzy-logic 
trust model for securing Grid resources. Grid security is enforced through trust 
update, propagation, and integration across sites. Fuzzy trust integration reduces 
platform vulnerability and guides the defense deployment across Grid sites. We 
developed a SeGO scheduler for trusted Grid resource allocation. 

The SeGO scheduler optimizes the aggregate computing power with 
security assurance under fixed budget constraints. The effectiveness of the 
scheme was verified by simulation experiments. Our results show up to 90% 
enhancement in site security. Compared with no trust integration, our scheme 
leads to 1 14% improvement in Grid performance/cost ratio. The job drop rate 
reduces by 75%. The utilization of Grid resources increased to 92.6% as more 
jobs are submitted. These results demonstrate significant performance gains 
through optimized resource allocation and aggressive security reinforcement. 



1, Introduction 

In Grid computing systems [2], user programs containing malicious codes may 
endanger the Grid resources used. Shared Grid resources once infected may damage 
the user applications running on the Grid platforms [8]. We address these issues by 
allocating Grid resources with security assurance. The assurance is achieved by 
hardware, software, and system upgrades to avoid application disasters in an open 
Grid environment. 

Mutual trust must be established between all participating resource sites. Like 
human relationship, trust is often expressed by linguistics terms rather numerically. 
Fuzzy logic is very suitable to quantify trust among peer groups. The fuzzy theory 
[10] has not been explored much in network security control. To our best knowledge, 
only Manchala has suggested a fiizzy trust model for securing E-commerce [12]. 



* The work was presented in the IFIP International Symposium on Network and Parallel 
Computing (NPC-2004), Wuhan, China, October 18-22, 2004. This research was supported by 
NSF/ITR Grant ACI-0325409 to the University of Southern California. 
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Azzedin and Maheswaran [3] and Liu and Shen [11] have developed some 
security-aware models between resource providers and consumers. Globus GSI uses 
public key certificates and proxies for trust propagation [7]. We use fuzzy inferences 
to consolidate security enforcement measures in trusted Grid computing. 

Our trust assessment involves the measurement of dependability, security, 
reliability and performability. Trust level is updated after the Grid successfully 
executed user jobs. Butt and Fortes, et al [3] protect Grid resources from distrusted 
applications. We choose a reverse approach by assuring security in the resource pool. 
Figure 1 shows the interaction between two Grid Resource sites. 




Fig. 1. Securing Grid resources with trust integration and resource brokerage 



We propose a Secure Grid Outsourcing (SeGO) scheduler to outsource jobs to 
multiple resources. We aim at maximizing the computing power under security 
scrutiny and cost minimization. In this paper, we report simulation results on the 
SeGO performance by executing 300 jobs on six Grid resource sites. Other studies on 
Grid resource allocation can be found in [2], [4], [6], [14]. 

The remaining sections are organized as follows. In Seclion 2, we preseni our 
disiribuled security architecture at USC GridSec project. Section 3 introduces the 
fuzzy logic for trust management. Section 4 describes the process of fuzzy trust 
integration. Section 5 introduces the optimized resource allocation scheme. All 
experimental results are reported in Section 6. Finally, we summarize the research 
findings and make suggestions for further work. 



2. GridSec Project for Trusted Grid Computing 

As shown in Fig.l, the GridSec {Grid Security) project at USC builds security 
infrastructure and self-defense toolkits over multiple Grid resource sites. The security 
functionalities are monitored and coordinated by a security manager in each site. All 
security managers work together to enforce the central security policy. The security 
managers overlook all resources under their jurisdiction [9]. The GridSec architecture 
supports scalability, high-security, and system availability in trusted Grid computing. 
Our purpose is to design for scalability and security assurance at the same time. 

Virtual Private Networks (VPNs) are built for Grid trust management among 
private networks through a public network. We establish only a minimum number of 
encrypted channels in the VPN. The VPN has a number of advantages over the use of 
PKI in Grid computing. Using encrypted channels in a Grid reduces or eliminates the 
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overheads in frequent authentieation; trust propagation, key management, and 
authorization in most Grid operations [14]. VPN aehieves single sign-on easily 
without using public-key certificates. VPN also reduces the packet exchanges among 
Grid sites. No certificate authority is needed in VPN-based Grid architecture, once the 
tunnels are established. The design is aimed at optimizing Grid resources under the 
constraints of limited computing power, security assurance, and Grid service budget. 




Fig. 2. use GridSec Project: A distributed Grid security architecture, built with encrypted 
tunnels, micro firewalls, and hybrid intrusion detection systems, coordinated by cooperative 
security managers at scattered resource sites 

A self-defense software library, called NetShield, is under development in the 
GridSec project. This package supports fine-grain access control with automatic 
intrusion prevention, detection, and responses [9, 13]. This system is based on 
dynamic security policies, adaptive cryptographic engines, privacy protection, and 
VPN tunneling. Dynamic security demands the adaptability in making policy changes 
at run time. Three steps are shown in Fig. 2 for intrusion detection, alert broadcast, and 
coordinated intrusion responses. 



3, Fuzzy Logic for Trust Management 

The trust relationships among Grid sites are hard to assess due to uncertainties 
involved. Two advantages of using fuzzy-logic to quantify trust in Grid applications 
are: (1) Fuzzy inference is capable of quantifying imprecise data or uncertainty in 
measuring the security index of resource sites. (2) Different membership functions 
and inference rules could be developed for different Grid applications, without 
changing the fuzzy inference engine. 

We close up the security gap by mapping only secure resources to Grid 
applications. Security holes may appear as OS blind spots, software bugs, privacy 
traps, and hardware weakness in resource sites. These holes may weaken the trust 
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index value. In our seheme, the trust index T (or tij) is determined by job success rate 
d> and self-defense capability A of eaeh resouree pair. 

In Fig. 3, we plot the variation of the trust index of a resouree site, as the job 
success rate and site self-defense capability are enhaneed from low to high values. 
These two attributes enhanee eaeh other on many eomputer platforms. The trust index 
inereases with the inerease of both eontributing faetors. The trust index eould 
deerease after network attaek ineidents. This plot guides the trust integration proeess. 
We alloeate resourees with high degree of seeurity assuranee. In subsequent seetions, 
we show a systematie method to produee the trusted conditions on Grid sites. 




Fig. 3. Variation of trust index T with respect to the variations of job success rate cD and 
intrusion defense capability A at each resource site 



Essentially, previous job execution experiences determine the trustworthiness of 
the peer machines. In the initialization of the trust index of a new resource site, the 
reported job success rate and intrusion defense capability are used to generate an 
initial trust value. The tmst index is then updated periodically with the site operations, 
until the site is removed from the Grid domain. 

We treat these security attributes as fiizzy variables, characterized by the 
membership functions in Fig. 4. In Fuzzy logic, the membership function p (x) for a 
fuzzy element x specifies its degree of membership in a fuzzy set. It maps element x 
into the interval [0, 1], while 1 for full membership and 0 for no membership. Fuzzy 
logic can handle imprecise data or uncertainty in the trust index of a resource site. 

Figure 4(a) shows “high” membership function for trust index F. A resource site 
with 0.75 trust index is considered high trust. Figure 4(b) shows five membership 
functions corresponding to very low, low, medium, high, and very high degree of 
trustworthiness. Figure 4(c) shows the cases of three ascending degrees of the self- 
defense capability. Figure 4(d) shows five levels of job success rate. The inference 
rules are subject to designer’s choice. 

Fuzzy inference is a process to assess the trust index in five steps: (1) Register the 
initial values of the success rate d> and defense capability A. (2) Use the membership 
functions to generate membership degrees for d? and A. (3) Apply the fuzzy mle set 
to map the input space (d? - A space) onto the output space (F space) through fuzzy 
‘AND’ and ‘IMPLY’ operations. (4) Aggregate the outputs from each rules, and (5) 
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Derive the trust index through a defuzzification process. The details of these five steps 
can be found in Fuzzy Engineering by Kosko [10]. 





(b) 5 levels of trust index, F 





Fig. 4. Membership functions for different levels of the trust index F, job success rate <1>, and 
site defense capability A 



Figure 5 shows the trust inference process using the membership functions in 
Fig. 4. We consider initial values: d> = 0.84 and A = 0.26, obtained from previous Grid 
application experiences. Two example fuzzy inference rules are given below for use 
in the inference process shown in Fig. 5. 



Rule 1: If<b is very high and A is medium, then F is high. 



Rule 2: If ® is high and A 


is low, then F is medium. 




/h 










Rule 1 : 


O is very hi^ 


AND 


A lis medium 


IMPLY 


r is high 
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Rule 2: 


O is high j 


AND 


/j is low 


IMPLY 


r is medium 



<F d 0.84 



A = 0.26 




AGGREGATE 




r = 0.6 



Fig. 5. Fuzzy logic inference between job success rate © and self-defense capability A to 
induce the trust index F of a resource site 



All selected rules are inferred in parallel. Initially, the membership is determined 
by assessing all terms in the premise. The fuzzy operator ‘AND’ is applied to 
determine the support degree of the rules. The AND results are aggregated together. 
The final trust index F = 0.6 is generated by defuzzifying the aggregation. The 
“AGGREGATE” superimposes two curves to produce the membership function for F. 
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There are many other fuzzy inferenee rules that ean be designed using various 
eonditional eombinations of the fuzzy variables, O and A. 



4, Trust Integration Across Grid Resource Sites 



We use trust integration aeross multiple Grid domains to model transitive trust 
relationship. Eaeh site Sj maintains a trust vector Vj = (tij, ■ The trust index 

tjj for 1 <i,j <m represents the trust of site A,- by site Sj. tyis a fraetion number with 0 
representing the most risky ease without any proteetion and 1 for a full trusted 
eondition with the highest seeurity assuranee. Any value in between indieates a 
partially seeured site. We define a trust matrix for m resource sites by an »ix»i square 
matrix M = = ( f,y). 

Trust update and trust propagation processes are specified in Algorithms 1 and 
Algorithm 2, respectively. We aim at reducing the site vulnerability and by upgrading 
its self-defense capability A. Suppose that the SeGO agent of site A,- has monitored all 
jobs executed on site Sj for some time to know its success rate and defense capability. 
Let Sij be the new security stimulus between sites A,- and Sj at certain time instant. 
Equation (1) calculates the new trust index from the old value and present stimulus. 



^new 

U 



^old 

aty +{l-a)Sy 



( 1 ) 



The weighting factor a is a random variable in the range (0, 1). For security- 
critical applications, the trust index should change timely to reflect new situation, thus 
a small a is adopted such as a < 0.3. But for low security applications, a large a is 
adopted with a > 0.9. In general situations, one can set a in the range of (0.7, 0.8). 

Algorithm 1: Trust_Update{index_TTL reports, i,j) 

(1) Ri calculate success rate of Rf. O = number of success }ohs/index_TTL; 

(2) Ri assess defense rate A of Rf, 

(3) Calculate the stimulus value: Sy= Fuzzy _inference(0. A); 

(4) Calculate the new trust index: t"™ = at“y + (1 - a)Sy ; 

z.r\ zz . new . ,, 

(5) il (( ty < ty ) or ( ty K avorage trust requirement)) 

Enhance defense capability of Rj, A(Rj) = A(Rj) + s(A). 

Algorithm 2: Tmst_Propagation{i) 

(1) /?,- broadcasts F;; 

(2) for y = 1 to i- 1 , i+1 to M 

(3) F/"”” = {m -1/m ) F“" + f7™ • 



We introduce two simulation terms: the trust index _TTL and trust vector _TTL, to 
measure user applications submitted to each site. When site Sj accumulates indexJTTL 
job reports from site A,-, it updates the trust index ty using Eq.(l). With fuzzy trust 
quantification, the stimulus value Sy is determined first. Then the new trust index ty is 
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updated, accordingly. If the trust index decreases or it is lower than the average, the 
defense capability A of Sj is forced to increase by an amount 8 characterized in Eq.(2). 

A(Sj) = A(Sj) + 8(A) (2) 

In Eq. (2), the increment 8(A) is a function of the current A value. If current A is 
high, 8 should be set with a small increment. If A is low, 8 should be larger. The site 
that has low A should catch up faster this way. The ultimate purpose of trust 
integration is to enhance the trust index or security level at weak resource sites. Trust 
integration leads to normalization and equalization of trust indices at all sites. The 
trust vectors are broadcasted periodically. When a site Sj has accumulated vector _TTL 
of job execution reports, it broadcasts its trust vector to other sites. With m sites, the 
contribution from each site is roughly Hm. Algorithm 2 is used to calculate the new 
trust vector for site A,- by each resource site Sj fory = 1,2,. . ., »i. 



5, Optimization of Trusted Resource Allocation 

Based on the fuzzy trust model, we present below in Algorithm 3 the SeGO scheduler 
for optimized Grid resource allocation. The SeGO scheduler was developed under the 
following assumptions: (1) non-preemptive job scheduling, (2) divisible workload 
across Grid sites, and (3) space sharing in a batch mode over multiple jobs. Our SeGO 
scheduler is specified with a nonlinear programming model [5]. 



Algorithm 3: SeGO {Rj, Job = (W, D, T, B)) 

Input: Submit Job = (W, D, T, B) to resource site Rj at time t, Rj requests 

resources from all m sites. 

Output: Workload distribution {Wi, W 2 , . . ., fV„,) and estimated execution time 
L for Job based on allocation X= {xi, X 2 , . . . , x„,) generated. 

( 1) R/ sends requests to obtain available resources information from all sites; 

(2) for i = 1 to ui 

(3) if (h/<7) x;=0. 

(4) end for 

(5) Estimate execution time L=D - t, 

(6) Generate the allocation vector 3f= (jT;, JC 2 , ...,JC„), which maximize 

m j m 

E = Yj x.P.Lt,, / Y x.P.LC. , subject to the following constraints 

;=1 ' ' ‘j I ‘ 

m m 

Y X.PL >W ,Y x.PLC. < B , and 0 < jc. < 1 ; 

I I ’ill ’ I ’ 

1=1 1=1 

(7) for i = 1 to »i Wi = XiPiL', 

(8) return(IFi, W 2 , ..., with allocation A'=(jei,jC 2 , ...,Xm). 



A job is submitted with the descriptor Job = (W, D, T, B), representing the 
workload, execution deadline, minimum trust, and budget limit. A job is required to 
complete execution before the posted deadline. Denote the current time instant by r. 
This is the start time of a job execution. The estimated job execution time is denoted 
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hy L = D - T. Let x,- be the pereentage of the peak power Pi alloeated to the job J. 
The produet XjPi represents the aetual power alloeated. We define Wi = XfPiL as the 
workload to be exeeuted at site Ri for the job J. 

The input to Algorithm 3 is the sueeessive job descriptions including the site and 
the time when job is submitted. After passing a qualification test, step (2)-(4), the 
unqualified sites are filtered out. The estimated execution time L is registered first. 
Then, the resource allocation vector A = (xi, X 2 , . . . , x^) is generated by optimizing the 
objective function or the trusted performance/cost ratio E, defined below. 

The numerator is the aggregate computing power, weighted by the trust index ty 
and allocated from m Grid sites. The denominator is the total Grid service charge for 
executing the job. The terms T*, and C, are the computing power and service charge at 
site Ri. The SeGO solution is obtained with a nonlinear programming solver, subject 
to the constraints listed in step 6. 



i=l / 



I = I x,P.Lf„ / 1 x,RLC, 



( 3 ) 



Algorithm 4 specifies the trust integration process, in which n jobs are mapped to 
m sites. The trust vectors are propagated and integrated periodically. If a job is 
submitted to Rj, this site is responsible to dispatch workload to all sites and monitors 
the job execution. Once a job is finished, the occupied resources are released for other 
jobs. User applications can resubmit their jobs, if the earlier execution was 
unsuccessful. The trust integration process includes trust update (Algorithm 1), trust 
propagation (Algorithm 2), and SeGO optimization (Algorithm 3). The inputs to this 
algorithm are jobs submitted at all sites. The output is the trusted resource allocation 
and the updated trust vectors. 



Algorithm 4: Trust integration for optimized resource allocation 
Input: n jobs submitted at m resource sites. 

Output: Resource allocation for jobs and updated trust vectors for all sites. 

(1) Do until (all submitted jobs are executed) 

(2) if (r = arrival time of current Job = {W,D, T, B)) 

(3) Job is put in the job queue of Rf 

(4) iWi, W 2 ,..., W^, L) <-SeGO (Rj, Job); 

(5) for i = 1 to m resource reservation, i.e., P,-= P,-- Wi/L; 

(6) end if 

(7) if (Pygets the previous Job = (W,D, T, B) report at time r) 

(8) for i = 1 to m 

(9) resource release, i.e., P,-= P,- + WjlL; 

(10) if (Rj accumulates indexJTTL job reports from P,) 

(11) Trust_ Update (indexJTTL reports, i, j); 

(12) if (Rj accumulates execution reports for vector _TTL jobs) 

(13) Trust Propagation (j); 

(14) end for 

(15) end if 

(16) end do 
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6, Simulation Results on Trusted Grid Resource Allocation 

We have developed a diserete-event simulator at USC to simulate the trust integration 
and resouree optimization proeesses. We simulated n = 300 jobs running on m = 6 
Grid resouree sites. Eaeh resouree site is eonfigured with eomputing power, whieh is 
set between 1 Tflop/s and 5 Tflop/s, randomly. Eaeh site is eonfigured with a site 
reliability and intrusion defense eapability in the range (0, 1). 

Jobs are mapped evenly aeross sites, and all job arrivals are modeled by a Poisson 
distribution with an inter-arrival time of 10 minutes. The job workload demand varies 
between 4 Tflop to 50 Tflop. The deadline varies between 4 minutes and 20 minutes 
after the job is submitted. The minimum trust is set in the range (0.4, 0.7) randomly. 
Both the resouree unit serviee eharge and user applieation budget limitations are set 
between $180K/Tflop and $320K/Tflop, randomly. 

Figure 6 depiets the variation of the trust index values at 6 resouree sites, Ri 
through /?6- The intial trust index values at step 0 vary from 0.07 to 0.77 for sites R\ 
through R(, along the Y-axis. The x-axis represents the trust integration step taken 
during simulation runs. The average trust index at eaeh site inereases steadily after 
eaeh step. Through the proeess, all trust indiees grow to the range (0.7, 0.93) at step 5. 

In the best ease, the lowest index value of 0.07 at site Ri inereases to 0.7 in 5 steps. 
This eorresponds to a seeurity enhaneement of 90% = (0.7 - 0.07)70.7 for site /?i. In 
the worst ease for site /?6, the trust index is upgraded from 0.7 to 0.93 in 5 steps. 
There is a normalization effeet of the trust integration proeess, whieh brings the 
seeurity levels of all sites to almost the same high level. 




Fig. 6. Variation of the tmst indices of six resource sites after five trust integration steps 

We present in Fig. 7 and Fig. 8 two seatter plots of the performance/cost ratio. The 
two seatter plots result from running the SeGO simulation under different trust 
management poliees. Eaeh triangle represents the performanee/eost ratio of one job. 
Both figures plot the Grid performanee under limited budget with initial trust values 
ranging from 0.07 to 0.77 given at step 0 in Fig. 6. 

Figure 7 depiets performanee/eost ratio E of 300 jobs with fixed trust, meaning no 
seeurity upgrade over the resouree sites. Figure 8 plots E with trust integration to 
upgrade the defense eapabilities at six resouree sites. We observed two job groups in 
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these plots. One group consists of those dropped jobs due to short of resources before 
the deadline expired. Those jobs are represented with E = () along the X-axis. The 
second job group contains the successful executed jobs. There are 76 dropped jobs in 
Fig. 7 and 18 dropped jobs in Fig. 8 out of 300 jobs simulated. This translates to a job 
drop rate of 76/300 = 25.3% in Fig. 7 and 18/300 = 6% in Fig. 8. 

In Fig. 7, the ii-plot for successful jobs varies from 1.67 to 2.71 Tflop/$lM with an 
average E = 2.27 Tflop/$lM. In Fig.8, the successful jobs achieve E = 1.67 to 3.57 
Tflop/$lM with an average E = 2.92 Tflop/$lM. Overall, the scatter plot in Fig.7 
shows almost no increasing trend as more jobs are submitted. However, the ii-plot in 
Fig.8 increases steadily as more jobs are submitted. Considering the last 50 jobs, we 
achieved E = 2.94 to 3.57 Tfop/$lM in Fig.8. We observe an improvement factor by 
1 14% = (3.57-1.67)/!. 67 for the best-case scenario. 



5 sJ 




0 50 



150 

Job number 



Fig. 7. Grid performance/cost ratio for 300 jobs allocated to six resource sites with fixed trust 
index and no site security reinforcement 
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Fig. 8. Improved Grid performance/cost ratio for 300 jobs allocated to 6 resource sites after 
trust integration and security upgrade 



In summary, our trusted resource allocation (Fig.8) shows a 76% - 114% 
improvement in Grid performance/cost ratio E. The job drop rate is reduced by (76- 
18)/76 = 75% in favor of trust integration solution. On the average E, a performance 
gain of 28% = (2.92-2.27)/2.27 was resulted from trusted resource allocation. As a 
matter of fact, the trust-integration process is at work very early on. After the 
submission of the first 15 jobs, the E starts to climb, and achieves more than 3.0 
Tflop/$lM at the 100* job. The results clearly demonstrate the effectiveness of trust 
integration. Trusted Grid sites accommodated 94% = 1 - 6% of 300 user jobs. 

Utilization rate is defined as percentage of allocated resources among all available 
resources. The utilization rate for resources with fixed trust values remains at the 
constant level at 40% during the simulation runs. The utilization rate for resources 
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with integrated trust values varies from a low of 48.1% to a high of 92.6%. The 
utilization of Grid resources increases with more jobs submitted. These results 
demonstrate significant gain in Grid performance through optimized resource 
allocation and aggressive security reinforcement by trust integration. 



Table 1. Utilization of Grid Resources at Six Sites for the Execution of 300 Jobs 



Grid resource 
utilization rate 


Job Number 


1-50 


51 - 100 


101 - 150 


151-200 


201 -250 


251 -300 


With fixed trust 


39.4% 


45.1% 


43.0% 


34.9% 


45.1% 


38.4% 


With trust integration 


48.1% 


78.0% 


65.4% 


84.2% 


92.6% 


82.9% 



7, Conclusions and Suggestions for Further Research 

This work offers the first step towards trusted Grid computing. In several recent 
reports from USC Internet and Grid Computing Laboratory, one can find 
comprehensive treatment of the GridSec architecture [9], Internet traffic datamining 
for automated intrusion detection [13], and trusted Grid resource allocation [14]. We 
summarize below research findings and make a few suggestions for further research. 

• Fuzzy trust integration reduces platform vulnerability and guides the defense 
deployment across Grid sites. Our VPN-supported trust integration is meant to 
enforce security in Grids beyond the use of PKI services [2, 9, 14]. 
Comprehensive simulation results were reported in [14] to prove the 
effectiveness of the SeGO scheduler for trusted resource allocation in 
computational Grids. 

• Self-defense toolkits are needed to secure Grid computing [9]. We have 
suggested the use of distributed firewalls, packet filters, virtual private networks, 
and intrusion detection systems at Grid sites. A new anomaly-based, intrusion 
detection system was developed with datamining of frequent traffic episodes in 
TCP, UDP, and ICMP connections as reported in [13]. 

• Regarding future research directions, we suggest to integrate the SeGO scheduler 
with other Grid job/resource management toolkits such the Globus/GRAM, 
AppLex, and NimRod/G [2, 4]. Grid security policies and Grid operating systems 
are needed to establish truly secure Grid computing environment [15]. 
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Abstract. Atomic Commitment Protocol (ACP) is an important part for any 
distributed transaction. ACPs have been proposed for homogeneous and 
heterogeneous distributed database management systems (DBMS). ACPs 
designed for these DBMS do not meet the requirement of Grid databases. 
Homogeneous DBMS are synchronous and tightly coupled while 
heterogeneous DBMS, like multidatabase systems, requires a top layer of 
multidatabase management system to manage distributed transactions. These 
ACPs either become too restrictive or need some changes in participating 
DBMS, which may not be acceptable in Grid Environment. In this paper we 
identify requirements for Grid database systems and then propose an ACP for 
grid databases, Grid-Atomic Commitment Protocol (Grid-ACP). 



1 Introduction 

Atomic commitment is one of the important requirements for transactions executing 
in distributed environments. Among Atomicity, Consistency, Isolation and Durability 
(ACID) [4] properties of a transaction. Atomic Commitment Protocols (ACP) 
preserves the atomicity of transaction running in distributed environment. Two-phase 
Commit (2PC) and its variants are widely accepted as ACP for transactions mnning in 
distributed data repositories [2,3,14], These data repositories are considered to be 
homogeneous, tightly integrated and synchronous. 

Grid infrastructure [7,8], a new and evolving computing infrastructure promises to 
support collaborative, autonomously evolved, heterogeneous, data intensive 
applications. Grid databases would access distributed resources in general and 
distributed data repositories in particular. Thus, protocols developed for 
homogeneous distributed architecture will not work in the Grid infrastructure. Hence 
classical approaches of data management need to be revisited to address challenges of 
grid databases. 

Transaction management is critical in any data based application, be it simple fde 
management system or structured Database Management Systems (DBMS). 
Transaction management is responsible to manage concurrency control and reliability 
protocols. Many applications will not need transactional support, i.e. ACID 
properties, while executing on Grids e.g. Business Activities [12]. Our earlier work 



H. Jin et al. (Eds.): NPC 2004, LNCS 3222, pp. 22-29, 2004. 

© IFIP International Federation for Information Processing 2004 




Atomic Commitment in Grid Database Systems 



23 



was focused on concurrency control in Grid environment [17]. In this paper we 
particularly focus on ACP in Grid environment. 

Grid databases [1,2] are expected to store large data from scientific 
experimentations viz. astronomical analysis, high-energy physics [16], weather 
forecasting, earth movement etc. These experiments generate huge volume of data 
daily. Particle physics experiments, e.g. Babar, may need to store up to 500 GB of 
data each day and is arguably world’s largest database that stores approx. 895 TB of 
data as of today (Mar ‘04) [15]. Wider research community is interested in generic 
data collected at various data collecting sites [1,10,13,15]. Distributed access to data 
raises many issues like security, integrity constraints, manageability, accounting, 
replication etc. But, here we will be mainly concerned with managing the transaction 
in Grids and its requirement of atomic commitment. In this paper distributed database 
is used in a broader sense to cover distributed/federated/multidatabase systems, since 
all these accesses data located at physically distributed sites, unless otherwise stated. 

The remainder of the paper is organized as follows. Section 2 explains the 
background work in distributed DBMS. Section-3 explains the working model and 
identifies the problem in applying existing ACP in the Grid model. We propose the 
Grid-ACP to meet Grid requirement for ACPs in section-4 along with proof of 
correctness of the protocol. Section-5 concludes the work and explains future work. 



2 Background 

Atomic commitment is an important requirement of transactions running in 
distributed environment. All cohort of distributed transaction should either commit or 
abort to maintain the atomicity property of the transaction and thus consequently 
maintain the correctness of stored data. We broadly classify distributed DB systems 
in two categories: (a) Homogeneous and (b) Heterogeneous distributed DBMS. 
Detailed classification can be found in [14]. 



2.1 Homogeneous Distributed Database 

2PC [4] is the simplest and most popular ACP proposed in the literature to achieve 
atomicity in homogeneous DBMS [3,14]. We briefly discuss 2PC from the literature 
to help our further discussion. The site where the transaction originates acts as 
coordinator for that transaction; all other sites where data is accessed are 
participants. 2PC works as follows [4]: 

The coordinator sends vote_request to all the participating sites. After receiving a 
request the site responds by sending its vote, yes or no. If the participant voted yes, it 
enters in prepared (or ready) state and waits for final decision from the coordinator. 
If the vote was no, the participant can abort its part of the transaction. After collecting 
all the votes, if all of them including the coordinator’s vote are yes then the 
coordinator decides to commit and send the message accordingly to all the sites. Even 
if, one of the votes is no the coordinator decides to abort the whole transaction. After 
receiving commit or abort decision from the coordinator, the participant commits or 
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aborts accordingly from prepared state. While the participant is in prepared state it is 
uncertain of the final decision from the coordinator. Hence 2PC is called as a 
blocking protocol. 



2.2 Heterogeneous Distributed Database 

Multi database systems assume heterogeneous environment [5,9] for transaction 
execution. They typically execute a top layer of multidatabase management system 
for transaction management. These systems are designed for certain application 
specific requirements and mostly for short and synchronous transactions. Due to high 
autonomy (design and execution) requirements in multidatabase systems, the ACPs 
are not designed for replicated data. Thus these protocols are not suitable for Grid 
environment. In literature [9] following major strategies are discussed for atomic 
commitment of distributed transaction in heterogeneous database environment: (1) 
Redo (2) Retry (3) Compensate. 

Since all sites may not support prepare-to-commit state and thus even if global 
transaction decides to commit, some local sub-transaction may decide to abort while 
others may decide to commit. Hence, transactions that decided to abort must redo the 
write operation, and commit, to reach consistent global decision [9]. Another 
approach to deal with above problem is the retry approach, as discussed in [9]. In 
retry approach, the whole subtransaction is retried rather than redoing only the write 
operations. Inherent limitation of this approach is that the subtransaction must be 
retriable. A subtransaction is retriable only if the top layer of multidatabase system 
has saved the execution state of the aborted subtransaction. If the global decision is to 
abort and any local subtransaction has already committed, then compensating 
transactions can be executed [9]. Compensating transactions also need to access 
information stored in global DBMSs. 



3 Grid Database Model and Problem Identification 

In this section we first discuss the general model and terminology that we use in our 
study. Then we discuss the problem in implementing standard ACPs in this model. 



3.1 Model 

The Grid middleware will join geographically separate computing and data resources. 
Concept of virtual organization (VO) [7] has been coined for integrating 
organizations over network. Grid infrastructure is expected to support and make use 
of web-services for specialized purposes. We focus on the collaborative, data 
intensive work that need to access data from geographically separated sites. The 
general model is shown below: 
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Fig. 1. General model of Grid database system 



All individual database systems are autonomously evolved and hence 
heterogeneous in nature. These database systems may join and leave the Grid as per 
their convenience. A transaction is termed as global transaction if it originates at any 
site and need to access data from other sites, in other terms if the transaction has to 
access data from more than one site it is a global transaction. The division of the 
global transaction at individual sites are called subtransactions. 



3.2 Problem Identification 

2PC is the most widely accepted AGP in distributed databases. 2PC is a consensus- 
based protocol that asks all the participating sites to vote whether subtransactions 
running at that site can commit. After collecting and analyzing all votes, the 
coordinator decides the fortune of the distributed transaction. It involves two phases, 
voting phase and decision phase, of communication messages before terminating the 
transaction atomically, hence the name two-phase commit. 

Many variations and optimizations have been proposed to increase the 
performance of 2PC. But, homogeneity between sites is the basic assumption behind 
the originally proposed 2PC for distributed databases. Multi/federated database 
systems are heterogeneous but the nature of transactions and applications these 
heterogeneous database systems are studied, designed and optimized are much 
different than their counterparts in Grid databases, e.g. for short, synchronized, non- 
collaborative transactions, to name few of them. These systems have a leverage of a 
top level layer, known as multidatabase management system that assists in making 
decision but Grids may not enjoy this facility due to distributed nature of database 
systems. Multidatabase employs redo, retry and compensate approach for AGP. These 
requirements may not be implemented in absence of top-layer management system 
and at the same time may be too restrictive [6]. Grid databases need to operate in a 
loosely coupled service-oriented architecture. Apart from data consistency 
perspective Grid databases will be expected to access data from via WWW [11,12]. 
Most of the distributed DBMSs are not designed to operate in WWW environment. 










26 



S. Goel, H. Sharda, and D. Taniar 



4 Proposed Protocol 

As discussed earlier, requirements of Grid DB systems cannot be satisfied by existing 
distributed DBMS. In this section we propose an ACP to meet these requirements. 



4.1 Grid Atomic Commitment Protocol (Grid-ACP) 



Before we proceed with the protocol we would like to remind that executing 
compensating transactions don’t result in standard atomicity of transaction. The 
notion is referred as semantic atomicity [9]. 

Figure-2 shows the state diagram of proposed Grid-Atomic Commitment Protocol 
(Grid-ATC). We introduce a new state and call it sleep state. The sub -transaction will 
enter in sleep state, when it finishes execution and is ready to release all acquired 
resources. Sleep state is an indication to transaction managers that the local sub- 
transaction of global transaction has committed. But it is still waiting for decision 
from the originator of the transaction. If any of the other participating sites aborts the 
subtransaction, the coordinator informs all the sleeping sites to compensate the 
changes made by the transaction. 
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Fig. 2. State diagram of Grid-ATC 



The Grid-ATC algorithm is explained as follows: 

1. The transaction originator splits the transaction based on the information at 
Grid-middleware service and submits to participating database systems. 

2. Respecting the autonomy of participating sites, they execute their portion of 
sub-transaction and goes to sleep state, after logging all the necessary 
compensating information in the stable storage. The site then informs the 
outcome of the sub-transaction execution to the originator. 

3. The originator, after collecting response from all participants, then decides 
whether to commit or to abort. If all participants decided to go in sleep state the 
decision is to commit else the decision is to abort. If the decision is to abort, 
message is sent only to those participants who are in sleep state. If the decision 
is to commit, it is sent to all participants. 

4a. If the local site decided to commit and is in sleep state and the global decision is 
also to commit, the transaction can directly go to commit state. As everything 
went as expected by the local site. 
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Grid-ACP: Originator’s Algorithm 

submit sub-transactions to participants; 
wait for response from all participants; 
if all response to sleep then begin 
write commit record in iog; 
send global_commit to all participants; 

end if 
else begin 

write abort record in log; 

send global_abort to participants who decided to commit 
wait for response from these participants; 

end 

return 



Grid-ACP: Participant’s Algorithm 

received sub-transaction from originator 
if participant decides to commit then begin 
write sleep in log 

send commit decision to originator 
wait for decision from originator 
if decision is commit then 
write commit in log 

end if 

end if 

else if decision is abort then begin 

start compensating transaction for this transaction 
line 10: if compensating transaction aborts then begin 

restart compensating transaction until it commits 
write commit for compensating transaction 
end if 
else 

write commit for compensating transaction 

end 

end if 

else if participant decides to abort then begin 
write abort in log 
send abort decision to originator 

end if 
return 



4b. If the local site decided to commit and is in sleep state but the global decision is 
to abort the transaction, then the local transaction must be aborted. But as 
mentioned earlier when the local site enters the sleep state it releases all locks on 
data items as well as all acquired resources. This makes abortion of transaction 
impossible. Hence, a compensating transaction must be executed to revert all the 
changes, using compensation rules, to restore the semantics of database before 
executing the original subtransaction, thus achieving semantic atomicity. If the 
compensating transaction fails, it is resubmitted. We are not defining the 
compensation rules as they are out of scope of the paper. 

Maintaining autonomy of local sites is primary in Grid environment. Considering 
that, different sites may employ different protocols for serializability as well. Some 
sites may employ locking protocols while others may employ timestamping or 
optimistic concurrency control strategy at local sites. Thus, in presence of such an 
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autonomous and heterogeneous environment in Grids and absenee of a top-layer 
management system it may be impossible to avoid easeading aborts. The proposed 
sleep state restriets the number of easeading aborts. We would also like to highlight 
that the sleep state does not interfere with the autonomy of the loeal sites. 
Implementing this state does not need any modifieation in loeal transaction manager 
module. Whenever the site deeides to join the Grid, the sleep state may be defined in 
the interfaee and henee no ehanges are required in any loeal modules. 

We briefly diseuss the time and message eomplexity of the proposed algorithm. 
Grid-ACP needs 2 rounds {time complexity) of message under normal eonditions: (1) 
after the loeal sites deeide to eommit/abort (2) the deeision from the originator. 
Maximum number of messages required is 2« {message complexity) to reaeh a 
eonsistent deeision under normal eonditions i.e. without failure. Where n is the 
number of partieipants in AGP. Considering that originator sends the final deeision to 
all the sites, the number of messages in eaeh round is n. 



4.2 Correctness of Proposed Protocol 

We show the eorreetness of our ACP by following lemma: 

Lemma 1: All participating sites reach the same final decision. 

Proof: We prove this lemma in two parts, part-I for consistent commit and part-II 
for consistent abort. 

Part I: In this part we show that when the global decision is to commit, all 
participant commits. From step-2 of the algorithm it is clear that the participants 
execute autonomously. If local decision is to commit, the information is logged in the 
stable storage and the subtransaction goes in sleep state after sending a message to the 
originator. If the originator of the transaction finds all commit decision in response, it 
sends the final commit to all participants. In this case the participant is not required to 
do any action as all resources were already released when the participant entered the 
sleep state. Participant just has to mark the migration of state from sleep to commit. 

Part II: The participants have to do more to achieve this part. In this part we show 
that if the global decision is abort all participants decides to abort. All participants 
that decided to commit now receives abort decision from the originator. Those 
participants decided to abort have already decided to abort unilaterally. Those 
subtransactions that decided to commit, have already released locks on data items and 
cannot be aborted. Hence, compensating transactions are constructed using the event- 
condition-action or the compensation rules. These compensating transactions are then 
executed to achieve the semantic atomicity (step-4b of the algorithm). To achieve 
semantic atomicity the compensating transaction must commit. If the compensating 
transaction aborts for some reason it is re-executed until it commits. The 
compensating transaction has to eventually commit, as it is a logical inverse of a 
committed transaction. This is shown in the state diagram by self-referring 
compensate state and line-10 of the participant’s algorithm. Though the compensating 
transaction commits, the semantic of the subtransaction is abort. Thus all participants 
terminate with consistent decision. ■ 
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5 Conclusion 

We have seen that ACP proposed or homogeneous DBMS e.g. 2PC is not suitable for 
autonomously evolved heterogeneous Grid databases. Strategies for traditional 
heterogeneous DBMS like multidatabase management system are too restrietive and 
need a global management system. We have proposed an ACP to meet Grid database 
requirements that uses sleep state for partieipating sites. The proposed sleep state will 
also help in putting a eap on the number of aborting transaetions. We also 
demonstrated eorreetness of the proposed protoeol. In future we intend to quantify 
and optimize the eapping values of the protoeol. 
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Abstract. Data grid system supports uniform and secure access of heterogene- 
ous distributed data resources across a range of administrative domains, each 
with its own local security policy. The security challenge has been a focus in a 
data grid environment. This paper mainly presents GridDaEn's security mecha- 
nisms. In addition to the basic authentication and authorization functionality, it 
provides an integrated security strategy featured by shared context-based secure 
channel building to leverage security processing efficiency so as to improve in- 
teraction performance occurring among multiple domains in GridDaEn. Mean- 
while, by means of proxy credential single-sign-on across multiple domains can 
be achieved. Experiments show that this approach can guarantee system secu- 
rity and reliability with great performance enhancement. 



1 Introduction 

Data grid system integrates and manages heterogeneous distributed data resourees 
aeross a range of administrative domains, eaeh with its own loeal seeurity poliey. The 
system implies a major seeurity ehallenge while providing eonvenienees. 

GridDaEn[l](Grid Data Engine) is a data grid middleware, it is implemented by 
NUDT (National University of Defense Teehnology). The system is faeed with some 
noted seeurity problems in data grid eireumstanees sueh as integration, interoperabil- 
ity and trust problems, whieh greatly eomplieate system seeurity meehanisms. Ae- 
eording to our grid applieation baekgrounds, we adopt GSI[2](Globus Seeurity Infra- 
strueture) as a basie framework and improve it with the introduetion of shared eontext. 
GridDaEn’s seeurity meehanism is built in eombination with PKI[3] (Publie Key 
Infrastrueture), with the following features ineluded: 

• Supporting mutual authentieation and eommunieation eonfidentiality; 

• Supporting a fine-grained RBAC authorization meehanism; 

• Supporting single sign-on; 

• Supporting shared eontext-based seeure ehannel building to improve per- 
formanee. 
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The rest of the paper is organized as follows: Seetion 2 presents a brief introduetion 
of GridDaEn system. Seetion 3 deseribes the strueture and features of GridDaEn 
seeurity system in detail. Seetion 4 shows the implementation. Seetion 5 exhibits the 
performanee improvement by eomparing the time overheads in building seeure chan- 
nels with two different approaches. Section 6 introduces some related work on grid 
security. Finally, in Sections 7 we present our conclusions and future work. 



2 GridDaEn System Overview 

2.1 GridDaEn Structure Model 

GridDaEn (Grid Data Engine) system is a Data Grid middleware, which can integrate 
various kinds of fde systems and provides uniform seamless access to distributed 
datasets. GridDaEn consists of four major components: Client tools. Security and 
System manager, DRB (Data Request Broker) servers, and MDIS (Metadata Informa- 
tion Service), as is illustrated in figure 1 . 
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Fig. 1. The structure of GridDaEn 



Fig. 2. GridDaEn structure with security mechanism 



There are more than one administrative domains in GridDaEn. In each domain, 
there is a DRB server, which performs actual data operations on local storage re- 
sources in responding to requests from users and applications. MDIS, which provides 
metadata service for each DRB server, is organized into a distributed structure, in- 
cluding several local metadata servers and a global metadata server. Security informa- 
tion such as authorization information is partly stored in MDIS. 
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2.2 A Job-Flow Across Multi-domains in GridDaEn 

In order to illustrate the mechanisms implemented in GridDaEn, first, we will analyze 

a typical job flow scenario across multiple domains in GridDaEn, as is demonstrated 

in Figure 3. 

• A user contacts DRB A, and submits a job to DRB A 

• DRB A contacts its MDIS M, in cooperation with which it checks the user's 
rights 

• DRB A locates the required data. If it is located in the local domain, DRB A will 
contacts site S where the data resides, and then obtain data from site S. Other- 
wise, DRB A will inquire of MDIS M about where the data resides, then find its 
broker, for example, DRB B 

• DRB A then contacts DRB B, and delivers the original job to DRB B 

• DRB B contacts its MDIS N, checks the user's rights, and then locates the Re- 
source site T 

• DRB B contacts site T, and obtains the required data 

• DRB B returns the data to the user via DRB A 




Fig. 3. Job workflow 



3 GridDaEn Security Mechanisms 

3.1 GridDaEn Security Structure 

As is mentioned before, it is necessary to provide many security functionalities such 
as authentication, authorization, and communication confidentiality to guarantee 
security in GridDaEn. To meet the security requirements raised in GridDaEn, such 
functionalities as GridDaEn CA (Certificate Authority), mutual authentication, au- 
thorization, communication confidentiality etc. are mainly provided. Meanwhile, 
single sign-on is realized by means of proxy credentials. The security structure of 
GridDaEn is illustrated in Figure 2. 
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In Figure 2, there is a CA loeated at the site where GridDaEn Global MDIS resides. 
GridDaEn entities, sueh as users, DRBs, MDISes and resouree sites, should request 
and obtain a digital eertifieate to identify themselves from CA. 



3.2 Features of GridDaEn Seeurity System 

As ean be seen from Figure 3 in Seetion 2.2, for seeurity guarantee, any two partiei- 
pants should do mutual authentieation to build a seeure ehannel before the job ean 
start. It is elear that, if the required data is within DRB A's domain, the system must 
build three seeure ehannels (as denoted by the solid lines). Otherwise, five ehannels 
(as denoted by the dashed lines) must be built. If the job is somehow more eomplex, 
the amount of seeure ehannels will be even larger. As a result, a large amount of time 
is eonsumed in building so many seeure ehannels. For performanee eonsiderations, it 
is neeessary to minimize the number of seeure ehannels. Therefore, shared eontext- 
based seeure ehannels are introdueed to be a solution. Besides this distinet feature, 
other seeurity meehanisms, sueh as single sign-on, RBAC authorization and so on, 
are also well-supported. 



3.2.1 Shared Context-Based Secure Channels 

From the above analyses, we ean infer that the performanee of seeure ehannels will 
greatly affeet the performanee of the whole system. Our solution is what we ealled a 
eontext meehanism, similar to a eonneetion pool, whieh reduees the overheads 
(eaused by seeurity authentieation) by sharing or reusing eontext. Sueh eontext is 
built by the seeure ehannel between any two partieipants in GridDaEn. When a DRB 
is started, it will automatieally start a mutual authentieation proeess with an MDIS to 
build seeure ehannels. After that, the information assoeiated with this proeess will be 
saved in a context, whieh has many properties and methods, as is sketehed in figure 4. 
Notiee that, for the dynamie and ehangeable nature of grid environment, eaeh eontext 
should have a lifetime field to speeify its validity period. Within the speeified lifetime, 
all the data transferred between this DRB and its MDIS will be enerypted and passed 
by this eontext, and the aeeess rights is also authorized by the identity reeorded in the 
eontext. When a user wants to aeeess some data resided in a site, a eontext between 
this user and the DRB for this site will also be established. 
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Fig. 5. A context table entry 
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Actually, different users cannot share the contexts between users and DRBs, that is 
mutual authentication must be performed again when a user wants to contact another 
DRB. However, those contexts between DRB and MDIS, or between DRB and re- 
source sites, or one DRB and another DRB, can be easily shared or reused by others. 
Therefore, overheads will be lowered and performance will be improved.. 

However, how to use and manage these contexts can be a troublesome problem. 
Here we adopt the idea of route table. Once the system builds a context, it will be 
classified by its creator and saved into a table with its lifetime. The address informa- 
tion of the two participants is used to index this table. Figure 5 gives an example of a 
context table's entries. Therefore, the subsequent requests can quickly find a context 
and greatly accelerate its authorization process. 



3.2.2 Single Sign-On and Anthorization 

Combined with user's proxy credentials, GridDaEn implements single sign-on. Before 
a user authenticates with a DRB, he will generate a proxy credential by his digital 
certificate. Then he authenticates himself to the DRB by this proxy. Afterwards, the 
proxy will execute all the activities that the user wants to do at all the sites, in com- 
plete representative of his identity. 

For authorization in GridDaEn security system, we borrow some ideas from RBAC. 
The basic concept of RBAC[4] is that users should be assigned some roles associated 
with specific permissions, and users' permissions are the combination of these roles. 
Here we also try to introduce this flexible and effective approach into our authoriza- 
tion mechanism. 

We have developed several tools for permission definition, role definition, and user 
definition. First, we define several elementary permissions for resources in our au- 
thorization mechanism. Then we define roles for each domain separately, assigning 
corresponding permissions to them. Note that, these roles only work in their own 
domains. Finally, we create grid users and assign roles to users according to some 
policy. If a user is assigned roles of a domain, he can perform the corresponding 
privileges in this domain. However, without roles of that domain, he cannot do any- 
thing. Thus, it can obtain fine-grained authorization. In order to simply authorization 
operations, we also introduce the notion of group, to which roles can be assigned. A 
user belonging to a group will inherit all the roles assigned to the group automatically. 
All authorization information is saved in MDIS. If authenticated, a user will be 
checked whether or not he possesses the specific privileges to complete his job. After 
that, the job will be run in identity of a local user by means of local user-mapping. 



4 Implementation 

GridDaEn security system is implemented in Java, therefore it can be installed on 
various platforms such as Linux, Windows, without any modification. We employ 
some functionalities of Java CoG Kit[5], such as authentication, proxy credentials 
generation, on which some improvement is made to meet our specific requirements. 
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We provide GridDaEn CA to suit for our purpose, which responds to certification 
requests from GridDaEn entities, also depicted in figure 2. Two participants authenti- 
cate each other using their own digital certificates. Another important module is to 
issue proxy certificate for each client, by means of which single sign-on can be 
achieved. Of course, direct authentication can also be made by digital certificate to 
get higher security privilege. After authentication between any two participants, all 
the transferred data would be encrypted by their context to guarantee communication 
confidentiality. And all interfaces and APIs are standard, supported by GSS- 
APIs(Generic Security Service Application Program Interface). 

Based on the above security functionalities, we build a context table to save con- 
texts built after authenticated. Also it records lifetime and other information about 
contexts. Figure 6 illustrates a Sequence Diagram using case view, which describes 
how to build and share a context across multiple domains in GridDaEn. As is de- 
picted, the context between DRB A and MDIS is built when client 1 contacts DRB A 
and submits job to DRB A, and it is used to process jobs from client 1, such as 5 and 
6. Also it is used to process jobs from client 2, such as 14 and 15 within its lifetime. 
Other contexts are similar as above. 




Fig. 6. A Sequence Diagram describing how to build and share context 

By adopting PKI model, our security system can integrate with existing systems 
and technologies, and it can be deployed on all kinds of platforms or hosting envi- 
ronments, with distributed structure support. Meanwhile, digital certificates signed by 
GridDaEn CA can help to establish trust relationships among multiple domains in 
GridDaEn by mutual authentication. 
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5 Performance 

For the main concern of security efficiency in our system, we have carried out some 
experiments to test its performance. The test environment is built with a 10Mbps hub 
and four machines. The client is running on an Intel machine with 2.4GFIz CPU, 
512MB of main memory, 40GB Seagate IDE disk and Windows 2000 operation 
system. A resource site is mnning on a machine as above. A DRB is running on a 
Pentium IV machine with 2.4GFIz CPU, 512MB of main memory, 40GB Seagate 
IDE disk and Red Flat Linux release 8.0 operation systems. An MDIS is running on a 
machine same as the machine running the DRB. 

In one experiment, a client sends a request to DRB, reading a file from some site. 
Firstly, one test for mutual authentication is made, secondly, another test processing a 
job request for security authentication and encrypting transferred data is carried out. 
Time consumed in both are illustrated in Figure 7. Note that, MDIS and Resource- 
Site authenticate with DRB separately, and DRB authenticates with Client in addition. 
As can be seen from figure 7, time consumed in authentication takes a large percent- 
age in the whole security processing (including authentication, data encryp- 
tion/decryption etc.). Therefore, the performance of authentication will greatly affect 
the performance of the whole security system. 
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Fig. 7. Test results comparing time overheads Fig. 8. Test results comparing time over- 
for authentication processing only and whole heads with shared context and without 
security processing shared context 



In another experiment, we test security overhead in DRB, MDIS and Resource-Site 
when processing many jobs from clients. Firstly, a test for many job requests without 
shared contexts is made, and secondly, we test these jobs with shared contexts. Figure 
8 illustrates the time overheads of both. Note that, with shared contexts, MDIS and 
Resource-Site authenticate with DRB once respectively, but DRB must authenticate 
with Clients each time. 

From the above figures, we can see that shared contexts can improve the efficiency 
of the security system to a large amount. 
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6 Related Work 

GSI is a component of the Globus Toolkit[3, 6] that has become the de-facto standard 
for Grid security. GSI employs PKI[4], which communicates by SSL (Secure Socket 
Layer), provides mutual authentication and communication confidentiality using 
public key cryptography and digital certification, and extends to support single sign- 
on. But authorization is too coarse-grained by a girdmap file. 

To summarize, although existing distributed security technologies can solve relat- 
ing problems faced in its domains, and provide some solutions, but which cannot 
adequately address the issues in our data grid environment. 



7 Conclusion 

In this paper, we present a new security mechanism implemented in GridDaEn system: 
the shared context-based security mechanism. The main contribution is that it can 
offer security guarantee while meeting the stringent performance requirements for 
GridDaEn. Currently, we are preparing for publishing the next version of this security 
system and totally integrating it into GridDaEn. 
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Abstract: During the last decade, much research has been done to evolve cost- 
effective techniques for performance monitoring and analysis of applications 
running on conventional parallel systems. With the advent of technology, new 
Grid-based systems emerged by the end of the last decade, and received 
attention as useful environments for high-performance computing. The 
requirements for discovering new techniques for monitoring the Grid have also 
come to the front. The objective of this paper is to study the techniques used by 
tools targeting traditional parallel systems and highlight the requirements and 
techniques used for Grid monitoring. We present case studies of some 
representative tools to get an understanding of both paradigms. 



1. Introduction 

Computer scientists and researchers have a long-time target to develop effective 
methods and tools for performance analysis of compute-intensive applications and 
providing support for automatic parallelisation and performance tuning. In other 
words, performance engineering of such applications in a cost-effective manner has 
been a major focus of the researchers during the last decade. Many tools for 
measuring or predicting performance of serial / parallel programs have been 
developed. These tools were designed with diverse objectives, targeted different 
parallel architectures and adopted various techniques for collection and analysis of 
performance data. The scope of these tools comprises instrumentation, measurement 
(monitoring), data reduction and correlation, analysis and presentation and finally, in 
some cases, optimisation. 

The second half of the last decade however witnessed radical changes in technology. 
Clusters of workstations / personal computers / SMPs offered an attractive alternative 
to traditional supercomputers, as well as MPPs. With the advent of Internet 
technology, we then entered the era of Grid Computing. Introduction of this new 
paradigm in the world of high perfonnance computing has forced the computer 
scientists to look into the performance-engineering problem with an entirely different 
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view. The term Grid indicates an execution environment in which high speed 
networks are used to connect supercomputers, clusters of workstations, databases, 
scientific instruments located at geographically distributed sites. Monitoring such an 
environment for executing compute-intensive applications and performance 
engineering these applications is one of the central tasks of Grid computing research. 
Performance monitoring of Grid and Grid applications, by its very own nature, differs 
from the underlying principles of traditional performance analysis tools. The 
heterogeneous nature of computational resources, potentially unreliable networks 
connecting these resources and different administrative management domains pose 
the biggest challenge to the performance monitoring / engineering community in 
Computer Science research. 

The objective of this paper is to study the main trends of traditional performance 
analysis tools and Grid monitoring tools, to compare the techniques used by them, to 
focus on the major issues of Grid monitoring and applicability of various techniques 
for monitoring the applications and resources in Grid environment. Traditional 
performance analysis tools are discussed in Section 2 based on some case studies. The 
paper then summarises the properties of Grid and compares with conventional 
distributed systems. It also highlights the main issues that need to be tackled while 
designing a Grid monitoring system (Section 3). Case studies on representative Grid 
monitoring tools are furnished in Section 4 followed by a discussion on various 
aspects of Grid monitoring. Section 5 presents related work and Section 6 concludes. 



2. Performance Monitoring and Analysis for Traditional Parallel 
Systems 

This section focuses on the performance monitoring and analysis techniques adapted 
for use on traditional parallel systems. We present case studies on two performance 
analysis tools with an emphasis on their important features. 



2.1 SCALEA 

SC ALE A is a performance analysis tool developed at the University of Vienna [17]. It 
provides support for automatic instrumentation of parallel programs, measurement, 
analysis and visualization of their performance data. The tool has been used to 
analyse and to guide the application development by selectively computing a variety 
of important performance metrics, by detecting performance bottlenecks, and by 
relating performance information back to the input program. 

The components of SCALEA include SCALEA Instrumentation System (SIS), 
SCALEA Runtime System, SCALEA Performance Data Repository, and SCALEA 
Performance Analysis and Visualization System. Each of these components can also 
be used as a separate tool. The SCALEA Instrumentation system (SIS) enables the user 
to select code regions of interest and automatically inserts monitoring code to collect 
all relevant performance information during an execution of the program. An 
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execution of a program on a given target architecture is referred to as an experiment. 
Unlike most performance tools, SCALEA supports analysis of performance data 
obtained from individual experiments, as well as from multiple experiments. All the 
important information about performance experiments including source code, 
machine information and performance results are stored in a data repository. SCALEA 
focuses on four major categories of temporal overheads including data movement, 
synchronization, control of parallelism, and additional computation. The classification 
of overheads in SCALEA is hierarchical. 



2.2 Paradyn 

Paradyn, from the Computer Sciences Department of University of Wisconsin, 
Madison, is a tool for measuring and understanding the performance of serial, parallel 
and distributed programs [8]. It consists of a data collection facility (the Data 
Manager), an automated bottleneck search tool (the Performance Consultant), a data 
visualization interface (the Visualization Manager), and a User Interface Manager. 
The central part of the tool is a multi-threaded process and communication between 
threads is defined by a set of interfaces constructed to allow any module to request a 
service of any other module. 

The performance consultant module of Paradyn automates the search for a 
predefined set of performance bottlenecks based on a well-defined model, called the 
search model. It attempts to find the answer of three questions: firstly why is the 
application performing poorly, secondly where is the performance problem located, 
and finally when does the problem occur. Paradyn uses dynamic instrumentation to 
instrument only those parts of the program relevant for finding the current 
performance problem. Dynamic instrumentation defers instrumenting the program 
until it is executed and dynamically inserts, alters and deletes instrumentation during 
program execution. What data to be collected is also decided during the execution of 
the program under the guidance of the Performance Consultant. 



2.3 Discussion 

In this section, we summarise the salient features of performance analysis tools 
discussed above. Although these tools are only representatives of the vast number of 
performance analysis tools available for traditional parallel architectures, the 
following discussion will give some idea about the techniques adopted by majority of 
them. 

The important components of performance analysis tools are (a) tools for 
instrumentation and data collection, (b) tools for analysing and identification of 
performance problems, and (c) tools for visualisation. 

Monitoring data is collected through source code instrumentation and trace libraries are 
used for recording monitoring events. For example, SCALEA uses SISPROFILING, 
and PAPI (does profiling using timers, counters, and hardware parameters) libraries. 
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Source code instrumentation is generally guided by the user (in an interactive 
manner). Modem tools employ dynamic instrumentation techniques to dynamically 
change the focus and performance bottlenecks are identified during run-time. 

The tools are designed to work on divers hardware platforms. For example, the 
platform dependent part of Paradyn is available for SPARC, x86 and PowerPCs. 

One requirement of these tools is to support different programming paradigms. 
SCALEA, for example, supports performance analysis of FiPF, OpenMP and MPI 
codes. 

Each tool defines their own data storage formats, thus limiting the portability of 
monitoring data. 



3. High Performance Computing and Grids 

Grids are persistent environments that enable software applications to integrate 
computational and information resources, instruments and other types of resources 
managed by diverse organizations in widespread locations. Computational resources 
on a Grid together can solve very large problems requiring more resources than is 
available on a single machine. Thus harnessing the power of these distributed 
computational resources "computational power grids" may be created for high 
performance computing. This concept is often referred to as a “Computational Grid”. 

Grids are usually viewed as the successors of distributed computing environments; 
although from the user's point of view conventional distributed systems and Grids are 
intrinsically semantically different. The resources in a Grid are typically 
heterogeneous in nature with fast internal message passing or shared memory [1]. A 
relatively slow wide area network externally connects the individual resources. 



3.1 Performance Analysis in Grid Environment 

Grid performance analysis is distinctively different from the performance analysis for 
traditional parallel architectures. The fundamental and ultimate goal of parallel 
processing is speeding up the computation, i.e. executing the same task in shorter 
time. In a Grid environment, however, speed of the computation is not the only issue. 
Mapping application processes to the resources in order to fulfdl the requirement of 
the application in terms of power, capacity, quality and availability forms the basis of 
Grid performance evaluation. The huge amount of monitoring data generated in a 
Grid environment are used to perform fault detection, diagnosis and scheduling in 
addition to performance analysis, prediction and tuning [2]. The dynamic nature of 
Grid makes most parallel and distributed programming paradigms unsuitable and the 
existing parallel programming and performance analysis tools and techniques are not 
always appropriate for coping with such an environment. 

The important elements that make performance analysis for Grids different from 
the performance analysis for SMPs or MPPs may be summarised as below [10]: 
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Lack of prior knowledge about the execution environment: The execution 
environment in a Grid is not known beforehand and also the environment may vary 
from one execution to another execution and even during the same execution of an 
application. 

Need for real-time performance analysis: Due to the dynamic nature of a Grid- 
based environment, offline analysis is often unsuitable. Sometimes it becomes 
necessary to predict the behaviour of an application based on some known 
characteristics of the environment. However, in the absence of precise knowledge 
about the future shape of the Grid, this prediction-based analysis is also of little 
use. 

Difficulty in performance tuning of applications: It may not be possible to 
exactly repeat an execution of a given application in a dynamic Grid environment. 
Therefore, performance tuning and optimisation of the application is difficult. 
Emergence of new performance problems: The very different nature of a Grid 
infrastructure gives rise to new performance problems that must be identified and 
tackled by the performance analysis tool. 

Requirement of new performance metrics: Usual performance metrics used by 
traditional performance analysis tools are inappropriate and therefore new 
performance metrics must be defined. 

Overheads due to Grid management services: Grid information services, 
resource management and security policies in a Grid environment add additional 
overhead to the execution of an application. 

Global Grid Forum proposed a Grid Monitoring Service Architecture [6, 15] which 
is sufficiently general and may be adapted in variety of computational environments 
including Grid, clusters and large compute farms. The essential components of the 
architecture as described in [6] are as follows: 

Sensors: for generation of time-stamped performance monitoring events for hosts, 
network processes and applications. Error conditions can also be monitored by the 
sensors. 

Sensor Manager: responsible for starting and stopping the sensors. 

Event Consumers: request data from sensors. 

Directory Service: for publication of the location of all sensors and other event 
suppliers. 

Event Suppliers or Producers: for keeping the sensor directory up-to-date and 
listening to data requests from event consumers. 

Event archive: for archiving the data that may be used later for historical analysis. 

In the next Section we present case studies on two Grid monitoring tools and 
highlight their important characteristics. 



4, Performance Monitoring and Analysis Tools for Grid 

Currently available Grid Monitoring Tools may be classified into two main 
categories: Grid infrastructure monitoring tools and Grid application monitoring tools. 
Although the primary focus of majority of the tools is on monitoring and analysis of 
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Grid infrastructure, many tools are nowadays concentrating on a unified approach. In 
the following sections, we present brief overviews of three such tools. It must be 
noted that there is no specific motive behind the selection of these three tools apart 
from rendering some basic idea about the requirements of Grid monitoring tools and 
their major components. A detail survey of existing performance monitoring and 
evaluation tools may be found in [5] which has an objective of creating a directory of 
tools for enabling the researchers to find relevant properties, similarities and 
differences of these tools. 



4.1 SCALEA-G 

SCALEA-G [18], developed at the University of Vienna is the next generation of the 
SCALEA System. SCALEA-G is a unified system developed for monitoring Grid 
infrastructure based on the concept of Grid Monitoring Architecture (GMA) [15] and 
at the same time for performance analysis of Grid applications [6]. SCALEA-G is 
implemented as a set of Grid services based on the Open Grid Service architecture 
(OGSA) [4]. OGSA-compliant grid services are deployed for online monitoring and 
performance analysis of a variety of computational resources, networks, and 
applications. 

The major services of SCALEA-G are: (a) Directory Service for publishing and 
searching information about producers and consumers of performance data, (b) 
Archival Service to store monitored data and performance results, (c) Sensor Manager 
Service to manage sensors. Sensors are used to collect monitoring data for 
performance analysis. Two types of sensors are used: system sensors (for monitoring 
Grid infrastructure) and application sensors (to measure the execution behaviour of 
Grid applications). XML schemas are used to describe each kind of monitoring data 
and any client can access the data by submitting Xpath / Xquery-based requests. The 
interactions between sensors and Sensor Manager Services also take place through the 
exchange of XML messages. 

SCALEA-G supports both source code and dynamic instrumentation for profiling and 
monitoring events of Grid Applications. The source code instrumentation service of 
SCALEA-G is based on SCALEA Instrumentation System [17]. Dynamic 
instrumentation (based on Dyninst from [3]) is accomplished by using a Mutator 
Service that controls the instrumentation of application process on the host where the 
process is running. An XML-based Instrumentation Request Language (IRL) has also 
been developed for interaction with the client. 



4.2 GrADS and Autopilot 

Researchers from different universities led by a team at Rice University are working 
in the GrADS project [12] that aims at providing a framework for development and 
performance tuning of Grid applications. As a part of this project, a Program 
Execution Framework is being developed [7] to support resource allocation and 
reallocation based on performance monitoring of resources and applications. A 
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performance model is used to predict the expected behaviour of an application and 
actual behaviour is captured during its execution from the analysis of measured 
performance data. If there is a disagreement between the expected behaviour and 
observed behaviour, actions like replacement or alteration of resources or 
redistribution of the application tasks are performed. 

The monitoring infrastructure of GrADS is based on Autopilot, which monitors and 
controls applications, as well as resources. The Autopilot toolkit supports adaptive 
resource management for dynamic applications. Autopilot integrates dynamic 
performance instrumentation and on-the-fly performance-data reduction with 
configurable, malleable resource management algorithms and a real-time adaptive 
control mechanism. Thus, it becomes possible to automatically choose and configure 
resource management policies based on application request patterns and system 
performance [13]. Autopilot provides a flexible set of performance sensors, decision 
procedures, and policy actuators to accomplish adaptive control of applications and 
resources. The policy implementation mechanism of the tool is based on fuzzy logic. 



4.3 Discussion 

The Section describes some representative tools for Grid monitoring and performance 
tuning of Grid Applications. It is evident that on a large, complex and widely 
distributed system similar to Grid, post-mortem performance analysis is of no use. 
Thus, all tools perform real-time analysis and some use these analysis data for 
automatic performance tuning of applications. The essential ingredients of real-time 
analysis are 

dynamic instrumentation and data collection mechanism (as performance problems 
must be identified during mn-time), 

data reduction (as the amount of monitoring data is large and movement of this 
amount of data through the network is undesirable), 

low-cost data capturing mechanism (as overhead due to instrumentation and 
profiling may contribute to the application performance), and 
adaptability to heterogeneous environment (for the heterogeneous nature of Grid, 
the components along with the communicating messages and data must have 
minimum dependence on the hardware platform, operating system and 
programming languages and paradigms). 

The basis of most monitoring tools is the general Grid Monitoring Architecture 
proposed by the Global Grid Forum. Flowever, when the tools extend their 
functionality to incorporate real-time performance analysis and tuning, they have to 
confront with complex issues like fault diagnosis and application remapping. 

Many of the already existing libraries and performance monitoring tools for 
traditional parallel systems form parts of the Grid monitoring tools. For example, 
SCALEA-G uses the instrumentation system of SCALEA [17] and Dyninst [3] and 
one of the current foci of GrADS project is to incorporate the instrumentation of 
SvPablo [14]. 
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5. Related Work 

Performance tools for conventional parallel systems are in use for quite some time. 
Therefore, many researchers have already studied, discussed and compared these 
tools. Some of the recent studies focus on the survey of Grid monitoring tools [2, 5]. 
In particular, [5] provides a directory of recently available tools along with a 
comparative analysis of them. 

The objective of this paper is not to prepare a survey report, nor does it aim at the 
comparative analysis of performance tools. The main intention is to study the 
requirements for Grid monitoring tools in comparison to the standard performance 
monitoring and analysis techniques applicable to conventional parallel architectures. 



6, Conclusion 

This paper discussed and highlighted different monitoring techniques for traditional 
parallel architectures and Grids. The observations that surfaced in this paper grow an 
understanding of the recent road-map of performance engineering tools for 
developing high performance applications targeting modem systems. In addition, a 
direction for new generation tools is also set up. 
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Abstract. In this paper, we present a Workflow-based grid portal for 
problem Solving Environment(WISE) which has been developed by in- 
tegrating workflow. Grid and web technology to provide an enhanced 
powerful approach for problem solving environment. Workflow technol- 
ogy supports coordinated execution of multiple application tasks on Grid 
resources by enabling users to describe a workflow by composing many 
existing applications and new functions, and provides an easy powerful 
tool to create new Grid applications. We propose new Grid portal to allow 
us to use Grid resources with improved workflow patterns to represent 
various parallelisms inherent in parallel and distributed Grid applications 
and present Grid Workflow Description Language(GWDL) to specify our 
new workflow patterns. Also, We shall show that the MVG (Model View 
Control) design pattern and multi-layer architecture provides modularity 
and extensibility to WISE by separating the application engine control 
and presentation from the application logic for Grid services, and the 
Grid portal service from Grid service interface. 



1 Introduction 

For the success of Grid computing, an easy powerful PSE, which provides com- 
puting resources and high quality apparatus to solve complex problems of science 
and engineering technology, are needed. In internet and distributed computing, 
there are useful technologies for PSE, which have evolved in parallel. Web tech- 
nology has emerged with revolutionary effects on how we access and process 
information. Grid computing enables us to use a large or nationwide network of 
resource as a single unified computing resource)!]. So, clear steps must be taken 
to integrate Grid and Web technologies to develop a enhanced powerful tool for 
PSE. Workflow technology is very useful because it enables us to describe a pro- 
cess of work by composing of multiple application tasks on multiple distributed 
Grid resources. It allows users to easily develop new applications by composing 
services and expressing their interaction. In [11,12,15,13,16,14,17,18,19], several 

* This work has been supported by a Korea University Grant, KIPA-Information Tech- 
nology Research Genter, University research program by Ministry of Information & 
Gommunication, and Brain Korea 21 projects in 2004. 
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researches about previous workflow management systems for Grid computing 
were published, and have been studied. However, their workflow models have too 
simple workflow patterns like sequence, parallel constructs like AND-split/merge, 
conditional constructs like XOR-split, and iteration constructs to implement par- 
allel and distributed Grid applications efficiently. 

In this paper, we present a Workflow-based grid portal for problem Solving 
Environment (WISE) which has been developed by integrating workflow. Grid 
and web technology to provide an enhanced powerful approach for problem solv- 
ing environment. Workflow technology supports coordinated execution of multi- 
ple application tasks on Grid resources by enabling users to describe a workflow 
by composing many existing applications or new functions, and provides an easy 
powerful tool to create new grid applications, we show our advanced workflow 
pattern and description language. Our new Grid portal allows us to use Grid 
resources with improved workflow patterns to represent various parallelisms in- 
herent in parallel and distributed Grid applications. Also, we are concerned 
about the design and implementation of Grid portal architecture enhanced with 
softwares to allows users to transparently access remote heterogeneous resources 
through Grid services. We shall show that the MVG(Model View Gontrol) de- 
sign pattern and multi-layer architecture provides modularity and extensibility 
by separating the application engine control and presentation from the applica- 
tion logic, and the Grid portal service from Grid service interface. 

The outline of our paper is as follows: In section 2, we describe the basic 
concepts of Grid and workflow technology, together with their related works. 
In section 3, we present our new workflow pattern and workflow description 
language. In section 4, we illustrate the architecture of WISE, and describe the 
detailed services. In section 5, we explain the implementation of WISE. In section 
6, we give conclusion. 



2 Related Work 

2.1 Grid User, PSE, and Grid Portal 

A Grid user does not want to be bothered with details of its underlying infras- 
tructure but is really only interested in execution of application and acquisition of 
correct results in a timely fashion. Therefore, a Grid environment should provide 
access to the available resources in a seamless manner such that the differences 
between platforms, network protocols, and administrative boundaries become 
completely transparent, thus providing one virtual homogeneous environment. 
Grid requires several design features: a wide range of services on heterogeneous 
systems, information-rich environment on dynamic Grid, single sign-on, and use 
of standards and the existing applications. Globus toolkit establishes a software 
framework for common services of Grid infrastructure by providing a meta com- 
puter toolkit such as Meta Directory Service, Globus Security Infrastructure, 
and Resource Allocation Manager [2, 3]. However, it is responsibility of applica- 
tion users to devise methods and approaches for utilizing Grid services. 
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PSE is a useful tool for solving problems from a specific domain. Traditionally, 
it was developed as client-side tools. Recently, a web-based Grid portals have 
been developed to launch and manage jobs on the Grid, via Web, and allow 
users to program and execute distributed Grid applications by a conventional 
Web browser. Webflow is a pioneering computing web portal work, where http 
server is used as computing server proxy using GGI technology [4]. GridPort[5] 
allows developers to connect Web-based interfaces with the computational Grid 
behind the scenes through Globus Toolkit [2] and Web technologies such as GGI 
and Perl. Hotpage user portal is designed to be a single point-of-access to all 
Grid resources with informational and interactive services by using GridPort. 
Astrophysics Simulation Gollaboratory (ASG) portal is designed for the study 
of physically complex astrophysical phenomena[6]. It use Globus and Gactus as 
a core computational tool. 

2.2 Workflow Patterns 

In Grid computing, workflow is a process that consists of activities and inter- 
actions between them. An activity is a basic unit of work: a grid service or an 
application executed on Grid. In [11,12,13], the existing workflow patterns are 
introduced for describing the only control flow. Most of them are basic patterns 
such as sequence, simple parallel constructs like AND-Split/Join, conditional 
constructs like XOR-Split/Join, OR-Split/Join, iteration construct. Other pat- 
terns such as N out of M join, deferred choice, and arbitrary cycle are presented 
in [11,15]. These are insufficient to express parallel applications. Triana [13,14] 
presents link elements for data flow and simple control flow patterns such as a 
pair of AND-Split and AND- Join, a pair of IF-Split and If- Join, and Gount 
Loop and While Loop. The Grid Services Flow Language(GSFL) [16] is an 
XML based language that allows the specification of workflow descriptions in 
the OGSA framework. It also use simple link elements. GridAnt [18] is a client 
side workflow tool based on java Apache Ant. It describes parallelism by specify- 
ing dependencies between tasks. The myGrid workflow [19] provides a graphical 
tool and workflow enactor, but in terms of the parallel control flow, they support 
only simple parallelism like the above workflow models and languages. 

3 Grid Workflow Description 

3.1 New Advanced Workflow Patterns 

We need new advanced workflow patterns for Grid applications to describe var- 
ious parallelism such as pipeline, data parallelism, and many synchronizing con- 
structs and to prevent incomprehensible workflow description. Gomplex workflow 
can be made by simple link patterns, but it is difficult. Moreover, any control 
flow produced by composing sequence and arbitrary cycle may generate ambi- 
guity which is a state to be too complex and difficult to comprehend correct 
meaning. Therefore, to describe a precise workflow easily and fast, the struc- 
tured patterns with clear context is more efficient than the non-structured ones 
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Fig. 1. Advanced basic control patterns: (a) Sequence (b) XOR-Split (c) XOR- 
Join (d) Loop (e) Multi-Split (f) Multi-Join (g) AND-Split (h) AND-Join (i) 
AND-Loop symbol (i’) AND-Loop description (j) Queue (k) Wait (1) Node copy 



made by any compositions of sequences and arbitrary cycles. The details of our 
workflow model are published in [20] 

In figure 1 we show our basic patterns with three groups. In sequential flow, 
there are four types: sequence for sequential control flow, XOR-Split for con- 
ditional execution, XOR-Join for conditional selection among many executed 
activities, and Loop for repetition. In mixed flow, there are two patterns: Multi- 
Split for multiple conditional execution, and Multi-Join for multiple conditional 
selection. Parallel flow includes AND-Split for parallel execution, AND-Join for 
blocked synchronization, AND-Loop, Queue, Wait, and Node copy. Whenever an 
iteration of AND-Loop is complete, two control flows occur to two directions like 
figure 1 (i’): the one for repetition and the other for next sequential activities. 
The circular-arrow in figure 1 (i) is the graphic notation of AND-Loop. AND- 
Loop can send many flows to a next node N continuously. If N is bottleneck, 
activities that send flows to N may stop or be processed slowly. A pattern is 
needed to prevent this situation. In queue pattern, all input control flows are 
stored in queue and transferred whenever the next node is idle. In node copy pat- 
tern, a node is copied up to the limited number of times. An idle node is selected 
and executed in parallel whenever a input control flow occurs. This pattern can 
increases computing power and may solve the above bottleneck problem. In wait 
pattern, wait node blocks until some or all input control flows are received. For 
example, in figure 1 (k) wait node blocks until all control flows generated by 
AND-Loop Til and node rii are received. The symbol means all. 



3.2 Grid Workflow Description Language 

We define the Grid Workflow Description Language (GWDL) which is an XML 
based language that specifies our workflow model using XML Schemas. The 
GWDL architecture consists of dataDefine, resourceDefine, activityDefine, and 
flowDefine elements. DataDefine lists user data types that are used to describe 
input/ouput data of an activity node. ResourceDefine describes host address 
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Fig. 2. The Overall Architecture of WISE 



or resource specification for executing an activity. ActivityDefine lists activi- 
ties which are executions of executable files or requests to running services on 
Grid resources. It describes function, names and types of input/output data. 
FlowDefine defines activity node elements which have an activity and a resource 
specification, and both of control and data flows which describe the interac- 
tion of activity nodes. It has basic control flow elements for representing our 
new workflow patterns in section 3.1. We also define three elements for control 
flow: <sequence>, <parallel>, and <loop>. Activity nodes in sequence element 
and parallel element are connected sequentially and executed concurrently re- 
spectively. Loop element iterates the sub-workflow. It has a ’pipeline’ attribute 
which indicates AND-Loop. Also, in flowDefine elements, there are data elements 
for describing data flow, which has source, destination, and variable name. 



4 WISE Architecture 

4.1 Overall Architecture 

Our Grid portal has a 3-tier architecture which consists of clients, web appli- 
cation server, and a network of computing resources like figure 2. Web appli- 
cation server in the middle tier is augmented with Grid-enabling software to 
provide accesses to Grid services and resources. It is designed as a multilay- 
ered structure which exploits MVG(Model-View-Gontroller) design pattern and 
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CoG (Commodity Grid) technology to construct a highly modular and flexible 
software environment. Application engine in web server controls and organizes 
the overall portal by executing the proper application logic component, accord- 
ing to the client request, which in turn carry out Grid services through Grid 
service interface, and then activating the presentation module which generates 
application speciflc display to be transmitted to the client’s browser. MVC de- 
sign pattern provides modularity and extensibility of Grid portal. Grid service 
interface components are implemented by using CoG technology which maps 
Grid functionality into a commodity tool. 



4.2 WISE User Interfaces 

Our Grid portal provides the following functions which allow users to easily 
access Grid resources and make new Grid applications efficiently by using our 
workflow-based Grid portal. 

User Authentication and Profile: This is basic components for Grid por- 
tal. A user is authenticated only once and provided all functions of our portal. 
WISE has user profile function which manages his/her information such as Grid 
certificate creation, update and management, list of available Grid resources, 
management of environment variables on many distributed host, request of re- 
source authorization to its manager and email address. 

Remote File Browser: This is file management functions such as file directory 
browsing of remote hosts, creation, delete, editing, upload from local file system, 
download to the local, file transfer between two remote hosts by GridFTP, and 
edit/save of remote file. By this basic and powerful function for remote file 
management. User can modify and move remote files, configuration/parameter 
files of application, and so on. 

Graphic Workfiow Editor: This editor is for creating a GWDL file. User can 
describe a workflow by graphical tool or text-based XML editor. First activities 
and its input/output data are defined. Second, User describes the interaction 
between them by making a workflow with them. User can specify the host to 
execute an activity, or yield the choice of host to workflow execution function. 

Our workflow system provides not only an executable file execution but also 
WISE-activity which interacts workflow supervisor tightly and can send data to 
another WISE-activity with socket. Workflow supervisor monitors the state of 
WISE-activity and controls its execution and data transfer through socket. User 
can program and compile WISE-activity codes with this workflow editor after 
define its name and input /output data in GWDL. User will use WISE-activities 
to implement new Grid applications or monitoring programs to report input, 
output, and state of an executed legacy application efficiently. 

Workflow Execution: After edit a GWDL file, we run it with this function. 
This shows the state of executed workflow graphically. The execution of an ac- 
tivity, state of input/output data transfer, values of some variables in workflow, 
and standard output/error of an running activity are displayed. If user specifies 
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the requirement for the host to run an activity, workflow supervisor will selects 
an idle host by using resource mapping portal service. 

Data Viewer: User can select a data file through file browser and see it with 
a data viewer which is associated with the data type. User can add new data 
viewer that is implemented by java or Microsoft COM to our Grid portal. 
Resource Information: Resource Information function enables us to And data 
and status about Grid resources such as CPU, memory, file system, network, os, 
and software through MDS. 

Simple Job Submission: User can send a request to run single executable file 
to a remote host and see the text type standard output of it. 

4.3 Grid Portal Service 

The Grid portal services forms the foundation of WISE, and are used directly 
by the application logic and presentation activated by the application engine. 
Each application logic and presentation has one-to-one correspondence with a 
user request from GUI on client side for solving a specific application problem 
on remote resources. In this subsection, we describe how various Grid portal 
services complete their missions by executing Grid services through the Grid 
service interface components. 

Single Job Submission: Job submission service can be executed in two modes. 
In user mode, user can prepare a job specification using RSL component, and 
job submission service executes the jobs in the remote resources as specified in 
RSL by using GRAM component. In automatic mode, job submission service 
finds a computing resources from resource mapping service. 

Information Service: Information service provides static and dynamic infor- 
mation on resources by querying and formatting results obtained from the MDS 
component in Grid service interface layer. It supports querying the MDS for 
hardware information such as GPU type, number of GPUs, GPU load and queue 
information that can be used by the user to make more efficient job scheduling 
decisions. 

Data Transfer Service: Data transfer provides file transfer capabilities be- 
tween client and target machine, or between third-party resources as well as file 
browsing to facilitate the transfer of programs or data files for remote execution 
and data retrieval by using GridFTP in Grid service interface layer. 

Security: Security service simplifies the authentication task by exploiting GSI 
of Globus which provides a secure method of accessing remote resources by 
enabling a secure, single sign-on capability, while preserving site control over 
access control policies and local security infrastructure. 

User Profile Service: User profile allows user to keep track of past jobs 

submitted and results obtained by maintaining the user history, and in addition 
enables the customization of a particular set of resources by users to provide 
additional application specific information for a particular class of users. Also 
sending an email, and management of user certificate are provided. 
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Workflow Supervisor: Workflow supervisor consists of three parts: parser, 
scheduler, and dispatcher. First, a GWDL file is inserted into workflow parser. 
The parser parses it and sends the activity node information into scheduler. Sec- 
ond, despatcher acquires data about executable files of activities and distributes 
the executables to Grid resources. Third, Scheduler runs the input GWDL file 
with the activity node information. It sends requests of activity execution to 
activity manager, monitors the information of activities, and controls activities. 
If no host name for an activity node, it get a host from resource mapping service. 
Activity Manager: Activity manager controls an execution of activities with 
Globus GRAM service, acquires the state of activities, and manages input/output 
data of activities with GridFTP service. Also it controls and monitors WISE- 
activity directly with socket for state acquisition and data transfer. 

WISE- Activity Editing/Complile: First, user describe the name, and in- 
put/output data of WISE-activity in a GWDL file, and then workflow editor 
generates the template codes for WISE-activity from content of the GWDL. 
Second, User can program logic of WISE-activity and compile the codes. This 
service provides two functions: making template codes and compiling user codes. 
Resource Mapping: This service finds computing resources to which a user 
can submit jobs by using MDS component, and return idle resources according 
to the user requirements such as memory size, GPU performance, and etc. 



4.4 Grid Service Interface 

Grid service interface defines classes that provide access to basic Grid services 
and enhanced services suitable for PSE, and encapsulates the functionality for 
Grid services offered in Globus toolkit by using GoG. 

RSL component provides methods for specifying resource requirements in 
Resource Specification Language(RSL) expressions. GRAM component provides 
methods which allows users to submit jobs on remote machines as specified in 
RSL, monitor job status and cancel jobs. GridFTP [8] component provides file 
transfer on Grid. GASS component supports the access of remote file system 
on Grid. MDS component allows users to easily access MDS service in Globus. 
MyProxy is a secure online certificate repository for allowing users to store cer- 
tificates and then to retrieve them at a later time. User can acquire Globus proxy 
with ID and password from MyProxy server. 

5 Implementation 

WISE deploys Apache web server, a free, open source web server that support 
SSL. WISE was developed under the open source Tomcat java servlet container 
which is freely and widely available and implemented by usig java language, JSP, 
Java Beans, and java servlet. Grid service interfaces are implemented in Java 
packages that provide the interface to the low level Grid services by using Java 
GoG kit[10]. WISE provides users with various interactive operations : login/out. 
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job submission, MDS query, GridFTP, file browsing, user profile, workflow edit- 
ing and running. They support transparent access to heterogeneous resources by 
providing a unified and consistent window to Grid by allowing users to allocate 
the target resources needed to solve a specific problem, edit and transfer pro- 
grams or data files to the target machines for remote execution and data retrieval, 
select and submit the application jobs automatically, query the information on 
the dynamic state of hardware and software on Grid for the more efficient job 
scheduling decisions as well as private user profile. In addition, WISE also pro- 
vides a easy-to-use user interface for using workflow-based parallel programming 
environment on Grid, by supporting graphical workflow editor, resource finding, 
authentication, execution, monitoring, and steering. Therefore, Grid portal pro- 
vides a transparent mapping between user interface and remote resources, hiding 
the complexity of the heterogeneous backend underlying systems. 

6 Conclusion 

Gommodity distributed computing technology in Web enables the rapid con- 
struction of sophisticated client-server applications, while Grid technology pro- 
vides advanced network services for large-scale, wide area, multi-institutional 
environments and applications that require the coordinated use of multiple re- 
source. In this paper, we present a web-based Grid Portal which bridge these two 
worlds on internet in order to enable the development of advanced applications 
that can benefit from both Grid services and commodity web environments. Also 
We provide new workflow patterns and GWDL which can overcome the limita- 
tions of the previous approaches by providing several powerful workflow patterns 
used efficiently to represent parallelisms inherent in parallel and distributed ap- 
plications. To describe a workflow of grid application without ambiguity, we have 
proposed formally new advanced basic patterns such as And-Loop, queue, wait, 
node copy and etc by classifying them into three categories; sequential, parallel, 
and mixed flow. Our workflow-based Grid portal have been designed to provide 
a powerful problem solving environment by supporting a unified and consistent 
window to Grid which enables a substantial increases in user ability to solve prob- 
lems that depend on use of large-scale heterogeneous resources. It provides users 
with a uniform and easy to use GUI for various interactive operations for PSE 
such as login/out, job submission, information search, file browsing, file trans- 
fer, and user profile, and especially supports interfaces for using workflow-based 
parallel programming environment on Grid, by supporting graphical workflow 
editor, resource finding, authentication, execution, monitoring, and steering. We 
have proposed a multi-layer architecture which can provide modularity and ex- 
tensibility by each layer interacting with each other using the uniform interfaces. 
Also, we have shown that MVG design pattern provides flexibility and modular- 
ity by separating the application engine control and presentation from the appli- 
cation logic for Grid services, and that commodity-to-Grid technology for Grid 
service interface supports various platforms and environments by mapping Grid 
functionality into a commodity distributed computing components. As a future 
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work, we are extending our Grid portal which enables the automatic conversion 

of a given problem into optimized parallel programming models by supporting 

more specific coarse-grained parallel programming models in our system. 
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Abstract. Because transactions in Grid applications often have dead- 
lines, effectively processing real-time transactions in Grid services pre- 
sents a challenging task. Although real-time transaction techniques have 
been well studied in databases, they can not be directly applied to the 
Grid applications due to the characteristics of Grid services. In this 
paper®, we propose an effective model and corresponding coordination 
algorithms to handle real-time transactions for Grid services. The model 
can intelligently discover required Grid services to process specified sub- 
transactions at runtime, and invoke the algorithms to coordinate these 
services to satisfy the transactional and real-time requirements, without 
users involvement in the complex process. We use a Petri net to validate 
the model and algorithms. 



1 Introduction 

One objective of developing a service Grid is to provide users with transparent 
services and hide the complex process from them. The technology for processing 
a real-time transaction is a key to determine whether the service Grid can be 
widely accepted in commercial use because many Grid applications have time 
restrictions and transactional requirements. The real-time transaction for Grid 
services [1] differs from conventional real-time database transactions because (a) 
Grid services are loosely coupled, and (b) Grid services can dynamically join and 
leave the Grid. Therefore, it is important to investigate the real-time transaction 
technology in the Grid service environment. 

The deadline of a real-time transaction specifies the time by which the trans- 
action must complete or else undesirable results may occur. Based on the strict- 
ness of deadlines, real-time transactions for Grid services can be classified into 
three types, similar to those in the traditional distributed systems [2]. 

— Hard real-time transaction. This is the strictest real-time transaction. If these 
transactions miss their deadlines, there are catastrophic consequences. 

® This paper is supported by 973 project of China(No.2002CB312002), and grand 
project of the Science and Technology Commission of Shanghai Municipal- 
ity(No.03dzl5027). 
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— Firm real-time transaction. It is of no value to complete a firm real-time 
transaction after its deadline but catastrophic results will not occur if a firm 
real-time transaction misses its deadline. 

~ Soft real-time transaction. Satisfaction of the deadline is primarily the per- 
formance goal. Unlike a firm real-time transaction, however, there still are 
some benefits for completing a soft real-time transaction after its deadline. 

In this paper, we focus on the soft and firm real-time transactions. Our moti- 
vation is to provide a model, with the key component of the real-time transaction 
service (GridRTS) so application programmers can use GridRTS to easily ma- 
terialize real-time applications. 



2 Related Work 

To process a transaction in a distributed environment, a common agreement is 
generally achieved by negotiations between a coordinator and the participants. 
DTP(Distributed Transaction Processing) [3] is a widely accepted model in dis- 
tributed transaction processing. It defines three kinds of roles (Application Pro- 
gram, Transaction Manager and Resource Manager) and two kinds of interfaces 
(TX interface between Application Program and Transaction Manager, and XA 
interface between Transaction Manager and Resource Manager). However, DTP 
does not support the real-time transaction. 

The real-time transaction schemes have heavily been researched in the data- 
base area. Abbott [4] presented a new group of algorithms for scheduling real- 
time transactions that produce serializable schedules. A model was proposed 
for scheduling transactions with deadlines on a single processor disk resident 
database system. The scheduling algorithms have four components: a policy for 
managing overloads, a policy for assigning priorities to tasks, a concurrency 
control mechanism, and a policy for scheduling I/O requests. Some real-time 
transaction scheduling algotirhms were proposed in [5], which employ a hybrid 
approach, i.e., a combination of both pessimistic and optimistic approaches. 
These protocols make use of a new conflict resolution scheme called dynamic 
adjustment of serialization order, which supports priority-driven scheduling, and 
avoids unnecessary aborts. 

This paper extends these previous results to the Grid service environment by 
providing the GridRTS with a set of interfaces for Grid application programmers. 



3 Real-Time Transaction Model 

The real-time transaction model we present here is based on the Globus Toolkit 
3. The core component GridRTS, as shown in Fig. 1, consists of the following: 

— Service Discovery. It discovers the required Grid services that can complete 
the sub-tasks for a real-time transaction. 
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Fig. 1. The real-time Grid transaction model 



— Deadline Calculation. It calculates deadlines of (sub)transactions to coordi- 
nate these activities. 

— Coordinator or Participant. It is dynamically created by the scheduler of 
GridRTS and lives until the end of a global transaction. The scheduler of 
GridRTS creates a coordinator or a participant when it receives a request to 
initiate a global transaction or perform a sub-transaction. 

~ Scheduler. It takes charge of scheduling above modules. 

— Interfaces. The OGSA interfaces are responsible for service management such 
as creating a transient Grid service instance while the TX interfaces are used 
to manage transactions. 



T{AND} 



TdOR) TjlORj 

I'll Ti2 T[3 T21 T22 T23 



Fig. 2. A simple real-time Grid transaction 



Definition 1. A real-time transaction is a 6-tuple={T, D, S, R, DL, P}, 
where T is the set of sub-transactions and each sub-transaction T^ is completed 
by several alternative functional services; D is the set of data operated by the 
real-time transaction; S is the state set; R is the set of relationships between 
(sub)transactions, defined as R={AND, OR, Before, After}; DL is the set of 
deadlines; and P is the priority set. 
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The AND relationship means that all the sub-transactions of a global 
transaction T must be completed before their deadlines d(Ti), i.e., T=Ti AND 
T 2 AND... AND Tm- Each sub-transaction is performed by n alternative 
functional services. The OR relationship means that T^ is completed if any T^- 
finishes before the deadline of Tj, i.e., Tj = T^i OR T ^2 OR . . . OR Tin (see Fig. 
2). Before and After specify the execution order between sub-transactions. 




Fig. 3. The execution flow of the real-time Grid transaction. 



4 Coordination of the Real-Time Grid Transaction 

4.1 The Process of the Real-Time Grid Transaction 

In the Grid service environment, a typical real-time transaction includes the 
following steps, as shown in Fig. 3. 

— Step 1: GridRTS initiates a global transaction on behalf of a real-time Grid 
application, discovers and selects required Grid services to serve as partici- 
pants, using the service discovery module as described in [6]. 

— Step 2: The scheduler creates a coordinator and broadcasts the Goordination- 
Gontext (GG) messages to all selected remote participants, that are created 
locally and return Response messages to the coordinator. 

— Step 3: The created coordinator and participants interact to control the 
transaction execution, including the correct completion and failure recovery. 
The detail is described in the following subsection. 
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4.2 Coordination Algorithms 

As described above, a sub-transaction is completed by a set of alternative func- 
tional services from an alternative functional service group (AFSG). The mem- 
bers of the AFSG execute the same sub-task in parallel. If one member of the 
AFSG can complete successfully and reports a Gommittable message, the AFSG 
is considered committable and other members are aborted. 

In the preparation phase, each alternative functional service executes a spec- 
ified sub-task in its private work area (PWA). On receipt of the Abort message, 
the service rollbacks the operations taken previously, by releasing the PWA. In 
the commit phase, the Gommit message notifies the participants that have re- 
ported Gommittable messages to the coordinator. These participants are called 
committable participants and actually commit sub-transactions (see Fig. 4). 



Algorithm of Coordinator 
Input iservice references of all functional 
alternative services Sj and d(T) 
Output iresult of T or failure 
{ 

for all Si in all AFSGs 
send Prepare messages to them; 
end for 

while (t< d(T)){ 

wait and record incoming messages; 
for each AFSG 
if (receive a Committable) 
send Abort to others in this AFSG; 
end for 

if (all AFSGs receive Committable) 
send Commit to committable 
Participants; 

else 

send Rollback to them; 

}} 

(a) Coordinator algorithm 



Algorithm of Participant 
Input: d(Ti) 

Output: result of Tj or failure 
{ while (t< d(Ti)) { 

wait & record incoming messages; 
if (receive a Prepare)! 
execute sub-task in PWA; 
if (successfully) 
report Committable; 
else { report Uncommittable; 
rollback; } } 

Case (receive a message) 

Commit: { 

actually commit sub-transaction; 
send Committed; } 

Abort: 

rollback and send Aborted; 
Rollback: 

rollback and send Rollbacked; 
EndCase } } 

(b) Participant algorithm 



Fig. 4. Goordination algorithms of the real-time Grid transaction 



Fig. 5 illustrates the state transformation diagram of the real-time Grid trans- 
action. The solid rectangles indicate the states of both the coordinator and 
participants. The Dashed rectangle denotes the state of participants. The trans- 
action enters Prepared state only when the coordinator receives a Gommittable 
message from each AFSG before the deadline d(T). Otherwise, the coordinator 
sends Rollback messages to undo the effect produced by the previous operations. 
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Initiate 




Fig. 5. The state transformation diagram of the real-time Grid transaction 

5 Algorithm Validation 

5.1 Modeling Algorithms with Petri Net 

A Petri net is an abstract and formal modelling tool. It models systems’ events, 
conditions and the relationships among them. The occurrence of these events 
may change the state of the system, causing some previous conditions to cease 
holding and other conditions to begin to hold [8] . In this work we model the coor- 
dination algorithms with a Petri net to verify their correctness. In the model, the 
transitions indicate actions taken by participants, and the places represent the 
states of the coordinator and/or participants or the receipt of the coordination 
messages from the coordinator. 

Assume that a real-time Grid transaction consists of two sub-transactions 
and each sub-transaction is completed by two Grid services. We use P/i, P/2 
and P//1, P//2 to represent the first and second AFSGs respectively. Without 
losing the generality, we let P/i and P//i first return Gommittable messages and 
finally commit while P/2 and P//2 are aborted. The Petri Net model RTPNM of 
this real-time transaction is depicted in Fig. 6, where P/i and P//i are illustrated 
by Si, and P/2 and P//2 by S2. The weights of the arcs indicate the number of 
changed tokens whenever a firing happens (i.e. added or removed). 



5.2 Analysis of the RTPNM 

Let M=(Mi, M2,. . . , M13) be a marking, where is the number of tokens in 
place Si. The RTPNM has two initial markings: 

— Mqs=(2, 2,4, 0,0, 2, 0,0, 0,2, 0,0,0), when P/i and P//i commit while P/2 and 
P//2 are aborted, and 



A Real-Time Transaction Approach for Grid Services 



63 




Sp Pii iii-Active $2: Pi2,ii2-Active S3: C-Prepare 

S4: Pii Ill-Preparing S5: Pn upPrepared Sg: C-Commit 
S?: Pii Ill-Committing Sg: P-Ended,C-Ended Sg: P12 ^-Preparing 



Sio:C-Abort S„:P 

^13* Pll,IIt,I2,II2"^®ll^^^^^§ 

T2: Pii iii return Committable 
T4: Pii iii return Committed 
Tg: P12.112 abort 
Tg- Pii, 12, 111,112 rollback 



Ill-Aborting Sp: C-RoUback 
Tp Piijii prepare for commit 
T3: Pii jii commit 
Tj: Pi2,ii2 prepare for commit 
T7: P12J12 return Aborted 
'P9' Pit, 12, 111,112 return Rollbacked 



Fig. 6. The Petri net model of the real-time Grid transaction(RTPNM) 



— Mo/=(2,2,4,0,0,0,0,0,0,0,0,4,0), when at least one AFSG can not prepare for 

commit, resulting in all four services are rolled back. 

The Petri net model can analyze the behavioral properties, which depend on 
the initial marking, including reachability, boundedness, liveness, coverability, 
reversibility, persistence and so on. For a bounded Petri net, however, the cover- 
ability tree is called the reachability tree and all above problems can be solved 
by the reachability tree [7]. Peterson [8] also pointed out that in Petri nets, many 
questions can often be reduced to the reachability problem. In this paper, we 
focus on the boundedness and reachability using the reachability tree, which is 
not illustrated here because it is too large. By analysis of the reachability tree 
of the RTPNM, we can draw following conclusions. 

Theorem 1. RTPNM is bounded. 

Proof: A Petri net (N, Mq) is said to be k-bounded or simply bounded if 
the number of tokens in each place does not exceed a finite number k for any 
marking reachable from the initial marking, i.e., < k for every place Si and 

every marking MsR(Mo)[7], where Mq is an initial marking and R(Mo) is the set 
of all possible markings reachable from the Mq. By inspection of the reachability 
tree of the RTPNM, we have found that uj (represent an arbitrarily large value) 
does not occur anywhere, and the number of tokens in each place is no more 
than 4. Therefore, the RTPNM is bounded and k is 4. 

Theorem 2. RTPNM is Ll-live. 
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Proof: A transition is Ll-live if it can be fired at least once in some firing 
sequences. A Petri net is Ll-live if all its transitions are Ll-live. For a bounded 
Petri net, the reachability tree contains all possible markings. After inspecting 
the reachability tree of the bounded RTPNM, we have found that every marking 
is reachable and every transition (1 < i < 9) can be fired at least once from 
Mqs or Mq/. Therefore, the RTPNM is Ll-live. 

Theorem 2 indicates that the RTPNM is a deadlock-free as long as the firing 
starts with Mq* or Mq/. Therefore, the coordination algorithms are correct. 



6 Conclusions and Future Works 

We have presented a real-time Grid transaction model. Its core component 
GridRTS can intelligently discover required Grid services as participants to per- 
form specified sub-transactions, and coordinate multiple Grid services to achieve 
real-time and transactional properties. Using the Petri net tool, moreover, we 
have validated the correctness of the coordination algorithms, whether a real- 
time transaction is successful, starting with Mqs, or failed, beginning with Mq/. 

In our future work, we will add a security mechanism to enable it to adapt 
to the actual commercial environment. 
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Abstract. In wireless Grid computing environment, end-to-end Quality 
of Service (QoS) could be very complex, and this highlights the increas- 
ing requirement for the management of QoS itself. Policy quorum-based 
management offers a more flexible, customizable management solution 
that allows controlled elements to be configured or scheduled on the fly, 
for a specific requirement tailored for a customer. This paper presents a 
QoS guaranteed management scheme. Policy Quorum Resource Manage- 
ment (PQRM), which facilitates the reliable scheduling of the wireless 
Grid system with dynamic resource management architecture aimed at 
fulfilling the QoS requirements. Experimental results show the proposed 
PQRM with resource reconfiguration scheme improves both performance 
and stability, which is suitable for a wireless Grid services. 



1 Introduction 

Grid computing provides widespread dynamic, flexible and coordinated sharing 
of geographically distributed heterogeneous networked resources among dynamic 
user groups. Wireless communications is a rapidly evolving and promising sector 
of the communications arena, and even in a challenging time for the telecommuni- 
cations industry, represents a significant development opportunity for companies 
and organizations in creating a global market. The increasing reliance on wire- 
less networks for information exchange makes it critical to maintain reliable and 
secure communications even in the instances of a component failure or security 
breach. In wireless networks, mobile application systems continuously join and 
leave the network and change locations with the resulting mobility impacting 
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the degree of survivability, security and reliability of the communication. Our 
research focuses on policy based dynamic resource allocation and management 
for Quality of Service (QoS) guaranteed applications in a wireless Grid [1], [2]. 

Our resource management policy will need to effectively map the pre-defined 
QoS requirements to the actual resources on the wireless network. Ideally some 
of these QoS requirements will be defined at the local system level to minimize 
protocols utilized at the middleware system, however, we need to address the 
problem of combining different kinds of QoS from the multiple resources avail- 
able in the Grid and among multiple Grids. We present a dynamic resource 
management architecture aimed at the fulfillment of the above requirements in 
wireless Grid services. We discuss the efficiency of the proposed reconfiguration 
policy on the experimental Grid testbed. We imagine that these kinds of reliable, 
secure and QoS guaranteed Grid services would be required for future mobile 
health care applications. 



2 The Management for Cardiovascnlar Simnlator in 
Wireless Grid 

A general architecture for future wireless access networks involves a hybrid of 
technologies. Advances in patient care monitoring have allowed physicians to 
track an Intensive Gare Unit patient’s physiological state more closely and with 
greater accuracy. Modern computerized clinical information systems can take 
information from multiple data sources and store it in a central location. The 
Multi-parameter Intelligent Monitoring for Intensive Gare (MIMIG) II database 
[3] takes advantage of the data collection systems installed at a partner hospi- 
tal to collect large numbers of real patient records. The MIMIG II Relational 
Database Management System (RDBMS) is used to administer this database 
and create table definitions. Data such as nursing notes, medications, fluid input 
and output, updates to patient charts, lab results, Gare Vue data etc., can be 
downloaded from the hospital’s Infrastructure Support Mart (ISM) and entered 
into the MIMIG II database. Waveforms such as electrocardiograms (EGGs), 
blood pressure, and pulse oximetry are stored in a separate waveform collection 
system in a separate database at the hospital, which is the waveform counterpart 
to MIMIG II. 

Fig. I represents a computational physiological scenario using a wireless Grid. 
Gardiovascular (GV) data measured at home are transferred to a ’’patient moni- 
toring manager” to assess those data. In this manager server, the monitored data 
is compared with the pathological data set stored in the MIMIG II database of 
PhysioNet [4]. If the signal is determined to be indicative of a pathological state, 
the cardiovascular information is used to determine the parameters of a GV 
model. GV hemodynamic variables under a specific physiological hypothesis are 
simulated and compared with the abnormal signal. In this simulation stage. Grid 
computing is utilized to speed up the analysis. 
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Fig. 1. Data flow for a mobile patient monitoring application using a CV simu- 
lation distributed on the wireless Grid interacting with a CV database 



3 Quorum-Based Resource Management Model 

3.1 Definition of Quorum Model 

Most Grid applications specify the type of required resources such as various 
types of processing, storage, and communication requirements among such el- 
ements. To guarantee the QoS in user’s application, the Grid manager should 
assign the resources to each application. The Quorum based management sys- 
tem will define the QoS vector to each application from the profile at input. 
We define two items; the one is system QoS vector and the other is network 
QoS vector. Each service requester must specify its QoS requirements for the 
Quorum Manager (QM) in terms of the minimum QoS properties for both the 
node system and network resources. We define the resource Quorum QR that 
represents current status of the resource. Each resource has its resource status 
vector which is represented both invariable and variable description [5]. Sys- 
tem resource can take the processor specification or memory size as invariable 
description and processor utilization or available memory space as variable de- 
scription and end-to-end available bandwidth, delay, data loss rate as variable 
specification. 



Qr= (1) 

where 9i denotes the current available resource level of the system resource i 
and 9jk represents the current available resource level of the network between 
system resource j and k. 

We assume the system has to admit and allocate resources for the set A = 
{a^, . . . , a^, . . . , a™} of applications. An application also represented by undi- 
rected graph with tasks and their communication relation. A required QoS level 
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represents the vector of the resource description and has the range between min- 
imum and maximum requirement. We denote them with q and Q, respectively. 
We define the QoS-Quorum, Qa, which represents the required quality level for 
an application set A. 






( 2 ) 



k V f2 

where and Q ^ denote the minimum and maximum QoS level required for 

task k on the system resource i, respectively, and represent the mini- 
mum and maximum QoS level required for communicating the task k in system 
resource i and task I in system resource j, respectively. 

A resource universe R = {ri, . . . , r„} assumes a collection of resources that 
can be taken in the administration domain. A resource, r =< S,N >, can 
be represented as undirected graph with system, S and their communication 
networks, N. To get the reliability in resource management, we define available 
resource Quorum, Qar, which is a selected resource set which satisfying the 
QoS requirement from SLAs. 



Qar = %) C <t,< 4", t-- < 4, < (3) 

where i,j = 1, . . . , n' < /i and k,l = 1, . . . , m. Qar is a set that satisfies a 
desired minimum QoS level of application. An available resource Quorum set is 
obtained from the policy module in the resource allocation server. 



3.2 Resource Configuration for Resource Scheduling 

The Resource configuration for scheduling, F{A, Qar}, is the mapping func- 
tion; 



F{A,Qar) = {(4,^#)} = {{V\E>^‘) ^ (^i,%)}, (4) 

where iQ = 1, ... ,n' < u and k,l = 1, . . . ,m. 

S^{9s) or S}{9n) is identified the resource topology from status of reflecting 
the resource quality. We intend to estimate current resource utilization of each 
application activity. Assuming that application represents its previous execution 
progress as an event, we can predict the current resource allocation sensitivity for 
the target application. On the network performance vector, end-to-end resource 
performance, such as bandwidth, delay, or data loss rate is peer-to-peer value. 
If a Grid resource management system should measure all the peer systems, 
it surely faces scalability problem and unrealistic. Thus, usually, in the initial 
step of the resource allocation, resource allocation and policy server generate the 
available resource Quorum in the network established by the resource brokers 
and each peer. 
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4 Policy Quorum Based Resource Management in 
Wireless Grids 

In order to address the resource allocation problem, we propose a Policy Quorum- 
based Resource Management (PQRM) System that provides a reliable resource 
scheduling scheme in wireless Grid. A PQRM shown in Fig. 2 has the layered 
architecture involving a session admission controller with an application profile, 
and requested applications as inputs [6], [7]. 

The Session Admission Controller (SAC) gives inform to Quorum Manager 
(QM). The QM inspects available resources that satisfy minimum QoS require- 
ments through MDS (Monitoring and Discovery Service). The Monitoring Man- 
ager (MM) obtains the information of Grid resources from the agent that queries 
the resources. The SLA is an important work for the SAC, which interacts with 
users and their requirement such as time or cost deadlines. 



PQRM server Client 




Fig. 2. Operational architecture of the PQRM system 



We can obtain the initial resource configuration set as mapping from sub-jobs 
to available resource Quorum set: the initial resource configuration will satisfy 
the QoS requirement because it is subset of the Qar which is the set of es- 
tablishment from the user’s SLA. Failure of system or admission of the other 
application in the resource has influence on the variable QoS vector. Temporal 
instants Ti’s are event generation points of each application. In general, in order 
to measure the application performance, instrumentation technique is a useful 
method that recognizes the application behavior. 

5 Experimental Results and Discussions 

Fig. 3 represents a computational physiological scenario using Grid computing. 
In this figure, the human ECG is monitored and transmitted to the PhysioNet 






70 



C.-H. Youn et al. 



server. The monitored data is then delivered to a simulator via a mobile com- 
puter, where the system compares it with the pathological data set stored in 
MIMIC [3]. In case of an abnormal signal, identified through a nodal fit to data 
in MIMIC, the algorithm may provide pertinent advice to the monitored sub- 
ject. The information can then be delivered to a medical doctor. Furthermore, 
personally optimized medication information can also be delivered to the patient 
client. 




Fig. 3. Data flow for a patient monitoring application using an ECG computa- 
tion distributed on the Grid interacting with an ECG database (MIMIC DB) 
via PhysioNet (a traditional client-server model) 



In a wireless Grid infrastructure, all computation needed to simulate the 
monitored ECG signal will be executed on the available resources provided by a 
PQRM system. For real-time processing, it is necessary to identify the resources 
available in the Grid infrastructure. In particular, in order to ascertain the re- 
liability of the wireless environment, policy based Quorum will be considered 
in the context of resource management. If a user wishes to access his/her daily 
biometrics from a Grid server, it is necessary for some computational resources 
to remain online regardless of its location with respect to the VOs. The sys- 
tems used in this experiment are eight in total, which consist of six systems in 
Korea and a PhysioNet server at MIT in the US. We assume that each user 
jobs Jl, J2, J3, J4, J5 and J6 starts at time instants, Tl, T2, T3, T4, T5, 
and T6, respectively. Each job is executed independently at each node. Each 
job is composed of 4 sub-jobs. Each sub-job calculates the entropy of the EGG 
signal that is transmitted from users. To identify the mobility-based resource 
management, we identified two types of experiments, e.g. a case of no resource 
reconfiguration after initial configuration and a case of resource reconfiguration. 
In case of no reconfiguration approach, the jobs were executed with resource 
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topology determined at SI. The initial Quorum Set is {SI, S2, S3, S4, S5, S6}. 
A non-reconfiguration method executes jobs sequentially at the system SI and 
resource reconfiguration method performs at each temporal instant Ti, such as 
Quorum S1=|(J1, Tl), (J2, T2), (J3, T3), (J4, T4), (J5, T5), (J6, T6)}. On the 
other hand, user’s mobility will invoke a policy-Quorum management module 
in PQRM to reconfigure the resource to guarantee the initial SLAs. A reconfig- 
uration algorithm may change the current Quorum set through reshuffling the 
resources topology. Fig. 5 shows that the resource reconfiguration was triggered 
at T4. After triggering the resource reconfiguration, newly generated resource 
Quorum and resource re-allocation will be determined with SI = jJl, J2, J3, 
J6}, S7 = |J4, J5} at T4. The S8 is added in new resource quorum and J4 
and J5 are moved from system SI to S8. Fig. 4 represents resource status after 
reconfiguration. Fig. 5 depicts the performance comparison of the average exe- 
cution time with and without a Resource Reconfiguration (RR) policy. Quorum 
management policy with resource configuration (in the bottom two bars) shows 
more stable execution time than with no reconfiguration management scheme. 




Fig. 4. Comparison of resource status (available CPU ratio) in SI and S8 after 
reconfiguration at T4 (NR: No reconfiguration policy, RR: Resource Reconfigu- 
ration policy) 



6 Conclusions 

End-to-end QoS can be very complex in the wireless Grid computing environ- 
ment. Policy based management offers a more flexible, customizable manage- 
ment solution that allows controlled elements to be configured or scheduled on 
the fly, for a specific requirement tailored for a customer. The proposed Quorum 
based resource management policy is suitable to effectively map the pre-defined 
QoS requirements to the actual resources on the wireless network. This paper 
has presented a QoS guaranteed management scheme that facilitates the reli- 
able scheduling of the wireless Grid system with dynamic resource management 
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Average execution time ( 




T1 T2 T3 T4 T5 T6 Total Avg. T ime 

^ Average Time per Job (NR) ® Average Time per Job {RR)j 



Fig. 5. Performance comparison of execution time with and without policy of 
resource reconfiguration (NR: No reconfiguration policy, RR: Resource Recon- 
figuration policy) 



architecture aimed at fulfilling the QoS requirements. We have discussed the per- 
formance and stability of the PQRM with resource reconfiguration scheme. An 
EGG analysis for mobile medical care has been presented, which would benefit 
from such a scheme. 
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Abstract. The success of grid computing depends on the existence of 
grid middleware that provides core services such as security, data man- 
agement, resource information, and resource brokering and scheduling. 
Current general-purpose grid resource brokers deal only with computa- 
tion requirements of applications, which is a limitation for data grids that 
enable processing of large scientific data sets. In this paper, a new data- 
aware resource brokering scheme, which factors both computational and 
data transfer requirements into its cost models, has been implemented 
and tested. The experiments reported in this paper clearly demonstrate 
that both factors should be considered in order to efficiently schedule 
data intensive tasks. 



1 Introduction 

The term grid computing is typically used to refer to a networked virtual comput- 
ing environment which extends across geographical and organisational bound- 
aries [1, 2]. It allows a group of individuals and/or institutions, a “virtual or- 
ganisation” , to share resources. Resources that can be shared include computing 
cycles, data storage, software, and special scientific equipment. The heteroge- 
nous and dynamic nature of the grid environment makes the task of developing 
grid applications extremely challenging. Thus, the success of grid computing 
will depend on the existence of high-level frameworks or middleware to abstract 
over the complexity of the underlying environment, and facilitate the design, 
development and efficient execution of grid applications. 

Broadly speaking, such a framework will have components for: high-level 
description of the stages of processing needed to produce desired results from 
available data; decomposition of that processing into tasks, or jobs, that can 
be deployed on grid resources; a resource broker mechanism for matching the 
requirements of these jobs with the resources available and scheduling their ex- 
ecution; and mechanisms for deploying jobs onto chosen resources. 

In this paper, we investigate resource management and scheduling issues 
associated with the resource broker component, which is responsible for matching 
resource requirements of jobs with actual available and suitable grid resources. 

Clearly, the resource broker must have ready access to up-to-date information 
about the grid in which to initiate a process of resource discovery to firstly find 
a set of candidate resources, and a mechanism for determining the most suitable 
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of these candidates. A resource broker takes into account relevant properties 
of a given resource, typically computational properties such as processor speed, 
amount of memory, and processor architecture. Other properties that may be 
considered by a resource broker for data grids are communication bandwidth 
and data storage. Information is provided by information services such as the 
Metadata Directory Service (MDS) and Replica Catalog of the Globus Toolkit [3] 
and the Network Weather Service (NWS) [4]. 

A resource broker that uses primarily computational properties in its decision 
making we refer to as compute- centric, and one using principally data-oriented 
properties as data- centric. A compute-centric strategy is appropriate when jobs 
are primarily computationally intensive. A data-centric strategy may be ap- 
propriate when using a massive data set, with better performance achieved by 
moving code to data. 

Our interest here is in general data grids, in which both computation and 
data are important. Many interesting applications involve geographically dis- 
persed analysis of large repositories of measured or computed data. For exam- 
ple, the Belle experiment [5], at the KEK particle accelerator in Japan, is a 
large world-wide collaboration. It generates terabytes of experimental and simu- 
lation data, with significant data transfer costs. Computation is also important, 
as researchers conduct simulations and analysis. Here, the broker should make 
better choices by using a strategy accounting for costs of both data transfer and 
computation. Studies by Ranganathan and Foster [6] support this view. 

However, current general-purpose grid resource brokers are compute-centric. 
Batch queuing systems such as PBS, LSF, Sun Grid Engine and Condor were 
originally developed to manage job scheduling on parallel computers or local- 
area networks of workstations, and therefore ignored the cost of data movement. 
These systems have been extended to manage job scheduling in wide-area com- 
putational grids, but still do not consider data transfer costs in their scheduling 
models. The Nimrod project [7] offers a novel approach to scheduling, based on 
an economic model that assigns costs to resource usage [8], allowing users to 
trade-off execution time against resource costs, but data transfer is not consid- 
ered by the scheduler. 

The AppLeS (Application Level Scheduling) project [9] approaches the sched- 
uling problem at the application level, dynamically generating a schedule based 
on application-specific performance criteria. Data properties can be used, but 
scheduling logic is embedded within the application itself, an approach not easily 
re-targeted for different applications or execution environments [10]. Our work 
aims to create a generic resource broker for any grid application. 

After the work reported in this paper had been completed, similar work 
on data-aware resource brokers was published. The approach of the Gridbus 
Broker [11] is that when a job is being scheduled, if there is a choice of compute 
nodes of similar capability, the scheduler will choose the one with the highest 
bandwidth connection to a copy of the input data. Version 1.1 of the JOSH 
system [12] for Sun Grid Engine also takes into account data transfer in its 
scheduling decisions. 
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The primary objective of this investigation is to show that application per- 
formance can be improved by making scheduling decisions that take into account 
not just the computational performance of resources, but also data transfer fac- 
tors. To this end, we discuss the design, implementation, measurement and eval- 
uation of a new, data-aware resource broker. 

2 The Data-Aware Resource Broker 

The Data- Aware Resource Broker (DARB) is a generic resource broker, designed 
specifically with the needs of data-intensive applications in mind. DARB under- 
takes resource discovery, that is, finding resources in the grid environment that 
match the needs of the application; and resource selection, that is, choosing 
between alternative data, software and hardware resources. 

For this investigation, DARB was implemented in Java within the Gridbus 
framework [13]. Gridbus is an extension of Nimrod/G [7], which supports execu- 
tion of parameter sweep experiments over global grids. Such studies are ideal for 
the grid, since they consist of a large number of independent jobs operating on 
different input data. Nimrod/G and Gridbus provide a simple declarative para- 
metric modelling language and GUI tools for specifying simulation parameters. 
A job is generated for each member of the cross product of these parameters. The 
system manages the distribution of the various jobs to resources and organises 
the aggregation of results. 

2.1 Architecture and Implementation 

Our experimental framework comprises three key components, shown in Figure 1. 

1. The Job Submission Gomponent provides a mechanism for users to submit 
applications for execution. 

2. The Scheduling Framework is the resource broker component, responsible 
for allocating the user-submitted jobs to available resources in an efficient 
manner. 

3. The Job Dispatching Gomponent interfaces with the underlying grid mid- 
dleware, dispatching and monitoring jobs. 

Here we focus on the scheduling framework, which we have implemented us- 
ing DARB. For these experiments, we use the job submission and dispatching 
modules provided by Gridbus. DARB interfaces with the Gridbus system, how- 
ever it is important to note that the DARB resource broker is generic, and can 
be “plugged in” to other grid environments. 

DARB consists of three modules: 

1. The Grid Information Module regularly queries the MDS and NWS for state 
information of the grid nodes and network links. This information is used 
to build a model of the current grid environment, representing the grid as a 
graph structure in which grid nodes are represented by vertices and commu- 
nication links by edges. 
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Fig. 1. System Architecture 



2. The Replica Catalog Module provides a convenient interface for accessing 
information on file locations using the Globus Replica Catalog. It is used to 
support data-aware scheduling. 

3. The Scheduler Module uses information from the Grid Information and 
Replica Catalog modules to map jobs (generated by the Initializer Module) 
to appropriate grid nodes (using the Dispatcher Module). 

To form a complete system, these modules are used in conjunction with 
Initializer, Dispatcher and Job Monitor modules provided by the Gridbus system. 

2.2 The Scheduler Module 

The data grid environment which the DARB targets consists of distributed sites 
that have both data stores and computational capabilities. Input data sets for 
applications are replicated among these sites. The Globus Replica Catalog pro- 
vides a mapping between the logical name of a data set and its physical locations. 
Each site in the environment may have different computational capabilities and 
data stores, which will change over time. In such an environment, it is vital for 
a resource broker to be provided with information about these characteristics, 
so it can and factor them into an appropriate performance model. 

In the Scheduler module, MDS and NWS information from the Grid Infor- 
mation module is used to determine the current state of the grid environment. 
The Replica Catalog is used to determine the locations of a data file requested 
by an application. In order to arrive at a schedule for an application, DARB 
takes into account the machine architecture, clock speed, and number of CPUs; 
the CPU load; the network bandwidth; the size of the input data (obtained from 
the Replica Catalog); and the amount of computation involved. 

The total time to execute a job on a particular compute node comprises of 
two components: data transfer time and computation time. If the compute node 
already has the input file, data transfer time will be zero. Otherwise, the time to 
transfer the input file is calculated using the bandwidth forecast obtained from 
the NWS. Experiments have shown that latency is negligible for the transfer of 
large data files, and as such, the latency forecast from NWS is not factored into 
the calculation. We use an estimate, based on input data size and bandwidth 
information (from the Grid Information Module), of expected transfer time. 
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A correction is applied in the calculation of the transfer time, since our 
experiments showed that the NWS tends to return bandwidth forecasts that are 
less than the actual value. As noted by Primet et al. [14], NWS transfers small 
amounts of data (default 64KB) which leads to very short transfer time on fast 
networks. Thus, any error in the time determination, due to the TCP slow start 
mechanism or other factors, will strongly influence measured bandwidth values. 
Primet et al. suggested the complementary use of additional tools to conduct 
less frequent probe experiments using larger amounts of data. An alternative 
approach is taken in DARB, which relies on measurements of data transfer times 
from the execution of previously submitted jobs. 

The expected computation time is taken from a user-supplied estimate of 
job execution time on a given architecture, adjusted by the current load on a 
candidate node, and the number of CPUs at that node. In parameter sweep appli- 
cations, the user estimate could be refined based on measured times for previous 
executions of the same program, however this has not yet been implemented. 

2.3 The Scheduling Algorithm 

Parameter sweep applications involve N independent jobs (each with the same 
task specification, but a different input file) on M distributed nodes, where N 
is typically much larger than M. In order to efficiently map these jobs to the 
nodes, it is important to consider both availability of computational resources 
and location of required data. A balance needs to be reached between minimizing 
computation and data transfer times. In general, if computational resources are 
available, it is preferable to execute the jobs where data already exists in order 
to minimize network traffic. However, when nodes containing data are busy and 
there are idle nodes elsewhere, the cost of data transfer may be outweighed by 
the increase in job throughput if the jobs are rescheduled to the free nodes. 

Initially, the Data-Aware Resource Broker attempts to allocate the jobs to 
the nodes that already contain the input data. The available CPU value of each 
node is monitored to ensure that they are not overloaded with jobs. This process 
continues until it is no longer optimal in terms of overall throughput for the nodes 
containing the data to execute the jobs. This occurs when there are idle nodes 
and the expected execution time on one of those nodes is less than the expected 
execution time on the node containing the data. If this is the case, the resource 
broker retrieves the list of nodes that are available and calculates the expected 
completion time for the particular job on each node. The expected execution 
time of a job on a particular node is calculated via the following steps: 

1. retrieve the list of nodes containing the input file required by the job; 

2. select, from the above list, the best node in terms of bandwidth to the node 
that the job is to be executed on; 

3. calculate the expected execution time, by adding expected execution and 
transfer times. 

The job is rescheduled to the node that gives the best expected execution 
time. 
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3 Experimental Evaluation 

This section presents the results of some initial experiments to evaluate the per- 
formance of the Data- Aware Resource Broker, and compare it with the standard 
compute-centric and data-centric resource broker strategies. The experiments are 
based on the requirements of the Belle data grid, but would be common to many 
scientific applications. 

It is assumed that there are a large number of data files distributed across 
the nodes of the grid, and a researcher wants to run a data processing program 
on each of a specified subset of these data files. The researcher may even want 
to run many instances of the program for each data file, using different input 
parameters. It is assumed that all required jobs can be run independently, and 
the researcher’s goal is to minimize the total turnaround time in executing all of 
these jobs. The resource broker should therefore schedule the jobs to the available 
nodes so that the job throughput is maximized. Hence, performance is measured 
in terms of total elapsed (wall clock) time to execute a specified set of jobs. 

In some cases the data files to be processed may be distributed fairly regularly 
across nodes in the grid, for example in the Belle data grid these could be the 
output of Monte Carlo simulations of the Belle experiment, which are generated 
on each of the nodes and might be stored on the node on which they were 
generated. In other situations the distribution of the data among grid nodes 
may be highly irregular, for example data from a physics experiment such as 
Belle may be stored on a single server at the experiment’s location, with replicas 
of the data located at a relatively small number of other nodes. 

The experiments were performed on the Australian Belle Data Grid testbed 
composed of five compute server nodes located at the the University of Mel- 
bourne, the University of Sydney, the University of Adelaide, and the Australia 
National University in Canberra. Two machines were located in Melbourne, the 
rest were hundreds of kilometers apart . All of the servers used Intel processors 
running Linux, with a single 2.8 GHz Xeon processor at Adelaide and Sydney, 
dual 2.8 GHz Xeons at Melbourne and Canberra, and a single 2 GHz Pentium 4 
processor at Melbourne. The average network bandwidth between the different 
machines varied considerably, from about 2 Mbits/sec (Adelaide to Canberra) to 
15 Mbits/sec (Sydney to Canberra) for machines in different cities, and around 
60 Mbits/sec between the two Melbourne nodes. We implemented a single clique 
containing all the nodes of our testbed for the purpose of running NWS sensors. 

A simulated application, called DataSim, was developed for initial testing of 
DARB. It is a simple program which reads from an input file, performs some 
computation, and writes to an output file. The program does not require all the 
input to be available when it starts. Computation can proceed as soon as there 
is a specified amount of input to be processed, i.e. it can support data streaming. 
Several factors are customisable via its arguments, including the compute/input 
ratio (how much computation is performed per input element); the output/input 
ratio (how much output is written per input element); and whether input and 
output are streamed. 
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Fig. 2. Total execution times for all 40 runs of the DataSim application with 
80 MB input files using compute-centric, data-centric or data-aware resource 
brokers. The first column gives the total elapsed wall clock time, while the second 
column is the aggregated time for computation and communication summed 
over all nodes. Results for even distribution of files are shown in the first row 
(compute/input ratio of 5) and second row (compute/input ratio of 10). The 
third row gives results for uneven distribution with compute/input ratio of 10. 



Each experiment consisted of 40 runs of the test application for a particular 
set of parameters, with four runs for each of 10 different input files that were 
distributed across all five nodes in the grid testbed. This was repeated 10 times 
for each experimental scenario, to get an average result. The following parameters 
were varied in the different scenarios: 
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— Size of input files: 40Mb and 80Mb input files were used in the tests. 
Obviously data file size varies greatly between different applications, however 
varying the value gives an indication of sensitivity to input file sizes. 

— Distribution of input files: Both even and uneven distributions of input 
files were tested. For the uneven distribution, the Adelaide and Canberra 
nodes contained 60% percent of the input files while the other 40% percent 
were distributed among the three nodes in Sydney and Melbourne. 

— Computation/Input ratio: This controls how many times a certain com- 
putation is performed for each element of input. Values of 5 and 10 were used 
in the experiments. The difference between the DARB and compute-centric 
algorithms will not be apparent for higher values of computation/input ratio 
since data transfer time will be insignificant. If the ratio is set too low, com- 
munication time will dominate and results for DARB will approach those of 
the data-centric algorithm. 

In calculating the network transfer times, it was assumed that the output 
files do not have to be transferred back to the client or a specific data storage 
node. These possibilities will be explored in future work. 

Figure 2 shows results for the scenarios listed above, for the 80 MB input 
data files. The results for the 40 MB input files are very similar, although as 
would be expected, the results for the data-centric approach are worse while 
the results for the compute-centric approach are closer to those produced by 
DARB. More detailed results for all of the experiments are available in Le’s 
thesis [15]. The most important measure is the elapsed wall clock time for the 
completion of all the jobs, which is shown in the first column of Figure 2. The 
second column shows the aggregated sum of compute times, data transfer times, 
and total times for job execution on each node, which gives an indication of the 
tradeoffs in data transfer and compute times for each algorithm. By considering 
both computational capability and cost of data transfers, the DARB approach 
gave the lowest total elapsed time in all the experiments. 

As expected, the data-centric approach did better when the input file sizes 
were larger and/or the computation/input ratio was lower. By always scheduling 
jobs to nodes which already contain the data, the data-centric algorithm left 
many of the more computationally powerful nodes idle while the slowest node was 
completing its allocation of jobs. This led to significantly poorer job throughput. 

The compute-centric approach did better when the input file sizes were 
smaller and/or the computation/input ratio was higher. Even in the cases when 
the compute-centric approach did almost as well as DARB for total execution 
time, the total data transfer time for the compute-centric algorithm was several 
times higher than that for the DARB algorithm. In situations where network 
costs are based on traffic volume, this could add significantly to the monetary 
cost of the computations. This can also lead to network congestion, and it was 
noted in some experiments that the large amount of network traffic caused cer- 
tain connections to become bottlenecks, leading to significant variation in per- 
formance in the compute-centric algorithm. 
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In the case where the data is irregularly distributed, the data-centric ap- 
proach does relatively worse. This is a problem since the data-centric model 
is perhaps most likely be used in situations with a relatively small number of 
servers having copies of very large data sets with large data transfer costs. 

4 Conclusion 

We have described the design, implementation and evaluation of DARB, a data- 
aware resource broker designed to efficiently handle the allocation of jobs to 
available resources in a data grid environment. Information from NWS and the 
Globus MDS and Replica Catalog are used to predict likely computation and 
data transfer times for each job, which are then used to generate a schedule for 
the jobs that tries to maximize throughput. 

The initial experiments reported here showed that DARB outperformed al- 
gorithms based only on computational requirements or data location, and hence 
it is necessary to take both into account for efficient job execution on a data 
grid. DARB also resulted in significantly reduced network traffic compared to a 
standard compute-centric approach. 

This Gridbus Broker [11] has an advantage over DARB, since it does not re- 
quire accurate estimates of the compute time or the data transfer time. However 
where such estimates are available, we would expect DARB to perform better, 
since the Gridbus Broker would not handle the situation where a slower node 
might be a better option due to much faster data communication times. 

The performance of DARB has so far only been tested using a small number of 
experimental scenarios on a small grid testbed using a simulated program. In fu- 
ture work, further experimentation using DataSim will investigate a wider range 
of input file sizes and computation/input ratios, different numbers of data files, 
multiple input files, and a variety of distributions and replications of data files. 
DARB will also be evaluated on a larger and more heterogenous grid testbed, 
and using a variety of real data grid applications. 

Gurrently DARB assumes that output files remain where they are computed. 
However, outputs are often needed elsewhere, such as the client, or a data repos- 
itory. DARB can easily be extended to take into account transfer of output files, 
but only if the size of these files is known beforehand. This could be specified by 
the user, or estimated based on previous runs. 

Gurrently the performance of the DARB is dependent on the accuracy of 
the user’s estimate of computation time for different architectures and and clock 
speeds. A better approach would be to keep a profile of past execution times for 
each node, which could easily be done in an environment for parameter sweep 
applications, such as Nimrod/G or Gridbus. 
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Abstract. Recent work towards a standard based, service oriented Grid 
represents the convergence of distributed computing technologies from 
e-Science and e-Commerce communities. A global service management system 
in such context is required to achieve efficient resource sharing and 
collaborative service provisioning. This paper presents VEGA system software, 
a novel grid platform that combines experiences from operating system design 
and current IP networks. VEGA is distinguished by its two design principles: a) 
ease of use. By service virtualization, VEGA provides users such a 
location-independent way to access services that grid applications could 
transparently benefit from systematic load balancing and error recovery, b) ease 
of deployment. The architecture of VEGA is fully decentralized without 
sacrificing efficiency, thanks to the mechanism of resource routing. The 
concept of grid adapters is developed to make joining and accessing the Grid a 
‘plug-and-play’ process. VEGA has been implemented on Web Service 
platforms and all its components are OGSI-compliant services. To evaluate the 
overhead and performance of VEGA, we conducted experiments of two 
categories of benchmark set in a real size Grid environment. The results have 
shown that VEGA provides an efficient service discovery and selection 
framework with a reasonable overhead. 



1 Introduction 

The Grid has been demonstrated as an efficient approach for providing computational 
services for various scientific applications [2]. Recently, common demands from the 
e-Commerce and e-Science communities have forced the convergence of technologies 
between these two fields. The move towards service-oriented Grids, exemplified by 
Open Grid Services Architecture [6] . 

In a service-oriented architecture (SOA)[7], a management system in charge of 
service provisioning plays a crucial role in mediating provider-consumer interactions. 
Challenges for managing a service Grid originate from the nature of the Grid: the 
diversity of resources and the dynamic behavior of resources. 

As the Grid is an open society of resources and users from different application 
domains, it is not applicable to predefine the unique criteria for classifying resources 
and all user requirements. Service publishing and matching protocols could be ad hoc 
or domain specific, making the global service discovery and scheduling more difficult 
to implement. Resources are often autonomous, which results in their volatile 
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behaviors. For example, a resource may join or leave the grid at any time without 
notifying the management system. This makes managing and searching for available 
resources increasingly complex. 

The objective of this paper is to present VEGA [5] (acronym for Versatile services, 
Enabling intelligence. Global uniformity, and Autonomous control) as a novel 
approach to deal with these problems. VEGA aims at developing key techniques that 
are essential for building grid platforms and applications. VEGA has been applied in 
building a national-wide grid testbed called Chinese National Grid (CNGrid), which 
targets putting high performance computers in China together to provide a virtual 
super computing environment. 

The rest of the paper is organized as follows. The section 2 introduces the related 
work. The designing principles of VEGA are depicted in Section 3. The Section 4 
describes the architecture. The Evaluation of Grid router is given in Section 5. The 
Section 6 shows the evaluation of VEGA. The conclusion is depicted in the Section 7. 



2 Related Works 

A number of research efforts target the management of services for dynamic, 
heterogeneous computing environment. In web service architecture, meta-information 
of services are maintained by a centralized repository like UDDI. Services published 
onto UDDI are location-dependant and dynamic information cannot be reflected. 

MDS in Globus has realized a globally uniform naming of distributed resources. In 
the MDS architecture, information is organized in the strict tree-like topology. The 
directory service used in MDS is LDAP, which is designed for reading rather than 
writing. While in grid environments resource may change frequently over time, which 
could result in a writing bottleneck. 

In ICENI, ontology information is annotated with the description of service 
interfaces, which facilitates the automatic matching and orchestration of services at a 
semantic level. It leverages Jini technology for dynamic service discovery and 
publishing with the help of a registry service similar to UDDI. 

P2P networks rely on routing techniques for locating resources. In early systems, 
message flooding is used in which queries are propagated along a dynamically 
changing path. Unfortunately, you cannot ensure two same queries return the same 
result set. In later semi-structured systems like CAN, distributed hash table (DFIT) 
algorithm is used for locating a resource within limited hops. 



3 Design Principles of VEGA System Software 



3.1 Virtualization and Ease of Use 

From the users’ point of view, resource management should be completely 
transparent. It is the responsibility of the management system to translate abstract 
resource requirements into a set of actual resources. 
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To obtain the full physical resource independent properties in VEGA, we adopt the 
conception of virtual services and physical services with the Grid Service Space 
(GSS) model. In this model, we present Virtual Service Space (VSS) and Physical 
Service Space (PSS) with coessential mapping, scheduling mapping and translating 
mapping. 

In the GSS model, a virtual service is an abstract representation of one or many 
physical services that have common interfaces. A virtual service can only be mapped 
to one physical service at one time point. This mapping process is called resource 
scheduling or resource binding. A programmer can refer to a virtual service by a 
location-transparent name and hence the application can obtain several benefits such 
as load balancing (by choosing alternative physical services with lower load), fault 
tolerance (by switching to a new physical service in response to service failure), 
locality of service access (by locating a nearer physical service), etc. 



3.2 Decentralization and Ease of Deployment 

A centralized resource management system is conceptually able to produce very 
efficient schedules. For the Grid, it is not practicable to build such a centralized entity 
as resources are owned by various administrative organizations. Conversely, a 
decentralized structure composed of many management points does scale well with 
increasing size of the Grid[l][3]. 

There are two fundamental methods to accomplish resource joining: 1) deploy the 
service onto a hosting environment and start it 2) register the resource’s meta 
information to the information service so as to make it open to grid users. Normally, 
node administrators manually do these registering works. Flowever, the Grid may 
accommodate huge amount of resources varying over time. Manually handling 
joining and leaving process of those resources could be impossible. 

Enlighten by IP network, we designed counterparts of routers and network adapters 
in a network system, which facilitate automating the discovering, publishing and 
deployment of resources in a grid environment [4]. In [4], we propose a 
Routing-Transferring resource discovery model, which includes three basic roles: the 
resource requester, the resource router and the resource provider. The provider sends 
its recourse information to a router, which maintains this information in routing 
tables. When a router receives a resource request form a requester, it checks the 
routing tables to choose a router for it and transfer it to another router or provider. 



4 Architecture 



4.1 Layered Logical Overview 

Scalability and ease of deployment are two driving forces of our architecture design. 
We proposed three key components in VEGA: Grid Operating Systems (GOS), Grid 
Routers (GR) and Grid Adapters (GA). Their responsibilities are similar to the 
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counterparts of operating systems, IP routers and network adapters in present network 
systems, respectively. 

VEGA is conformed to the OGSA framework. Currently VEGA was implemented 
based on Globus Toolkit 3, which is used to encapsulate the physical services and 
physical resources. A Grid Application accesses to virtual services the Grid provides 
by Vega Grid operating system (GOS) which call the Vega Grid router to discover the 
Grid or web services. Figure 4.1 shows the layered architecture of VEGA. 




Figure 4.1. The layered architecture of VEGA. 



At the resource layer, the distributed resources are encapsulated as grid services or 
web services hosting by various application servers such as GT3 container and .NET. 
Like a network adapter, a GA is enabling software for connecting a node machine to 
the Grid, making services hosted by the node machine known to the nearby GR. 

The system layer comprises two components. The interconnection of GRs 
constitutes an overlay network that provides underlying information service for 
resource mapping. GRs are capable of routing resource requests to appropriate 
matching nodes. GOS could aggregate encapsulated physical services together and 
provide a virtual view for application developers. 

At the application layer, developers can use the APIs, utilities and developing 
environments provided by Vega GOS to build virtual service based applications. 

GOS is the kernel of the entire VEGA architecture. It is comprised of job broker 
service, resource management service, fde service, system monitor service, GSML 
server, and user & CA virtual interface service. 



4.2 Runtime Architecture 

Figure 4.2 illustrates a runtime architecture of VEGA and information flows when 
deploying grid resources, locating grid resources as well as using grid resources. Two 
categories of grid nodes are shown in the figure, one equipped with GA (left outline 
rectangle) and the other without (right outline rectangle). 

A complete service access process from the Grid client-side consists of the 
following steps: Grid clients submit abstract resource requests to the GOS, then GOS 
forward the requests to any one GR. The response data will contain a candidate set of 
physical services. By certain resource selection algorithms, GOS selects one 
appropriate service from the candidates and deliver subsequent invocations from the 
client to it. The invocation requests and responses are passed through GOS with the 






Managing Service-Oriented Grids: Experiences from VEGA System Software 



87 



exception of large bulks of data transferring or other situations where performance 
becomes critical. 




5 Evaluation of Grid Routers 

Grid routers are the backbone of VEGA system, which are transfer stations for 
resource request. We propose a Routing-Transferring resource discovery model, more 
details can be get in section 3.2. Figure 5.1 illustrates how a request mr is transferred 
from router RO to R2 and eventually UO finds the resource r which is located on the 
provider PI. 



a Request or 
< C Provi der 
^ Rout er 

Figure 5.1 The routing-transferring model and the process of locating resource 

In [9], based on the router architecture, we propose a three-level fully decentralized 
and dynamic VEGA Infrastructure for Resource Discovery (VIRD) shown in Figure 
5.2. The top level is a backbone consisting of Border Grid Resource Name Servers 
(BGRNS); the second level is made up of several domains and each domain consists 
of Grid Resource Name Servers (GRNS); and the third level is leaf layer that include 
all clients and resource providers. A client can query its designated GRNS in two 
ways, recursive or iterative, and the server will answer the request by its knowledge 
about the grid. 

We present a resource naming scheme and a link state like algorithm for resource 
information propagation. The VIRD architecture allows every server has its own data 
organization, searching and choosing policies. 

The analysis shows that the resource discovery time is dependent on topology and 
distribution of resources. When topology is definite, the perfomiance is determined by 
resource frequency and location. The result shows that high frequency and even 
distribution can reduce the resource discovery time greatly. 
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Figure 5.2 The three-level VIRD architecture: a backbone, domains and leaves. 
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Figure 5.3 Graph of different resource locations and frequency 



6 Evaluation of System Overhead and Throughput 

In this section, we report preliminary results regarding system overhead and 
throughput of VEGA. The overhead derives from three aspects: a) The XML-based 
SOAP protocol overhead, b) key components of VEGA are secure GT3 services that 
require overhead c) VEGA is written in JAVA. The overall performance is sometimes 
unstable due to the JVM scheduling mechanisms such as auto garbage collection. 

The experiments were done over a non-dedicated network of PC servers in the 
Institute of Computing Technology (ICT) and National Research Center for 
Intelligent Computing Systems (NCIC) of China. For simplicity, we deployed one 
GOS at ICT, two resource routers both at ICT and NCIC. Resource routers connect 
dozens of nodes mainly PC servers with dual 1 SOOMhz AMD Athlon processors. 

We use two categories of applications as the benchmark set. One is a simple echo 
application whose execution time on a single machine is so trivial that it can be used 
to measure the overall system overhead. Another is the notable biology software 
named BLAST, which is used to measure system throughput. 

We divide stages of a job execution into three phases: initialization, execution and 
ending. The initialization phase starts when the job has just been submitted; it ends 
when the job’s status becomes submitted. The initialization time represents the 
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overhead of a middle management system. The execution phase is the real execution 
time of the job. In the ending phase, temporary data or garbage is collected. 

Figure 6.1 have shown the affect of different scheduling strategies on the execution 
time of jobs, and the system overhead. A series of identical job requests are orderly 
submitted to GOS. The execution time of using load balancing scheduling 
outperforms using round robin. We observe that with the job index increases, the 
execution time reduce smoothly. The reason is that since we repeatedly submit the 
same job, system cache in components takes effect so less time was consumed. It is 
also showed that the system overhead is relatively large, about 2 and 4 seconds, 
respectively. This is because the timer on the client side updates each 2 seconds. 



Blast(1s)-one job at a time 




Figure 6.1 Execution time of Blast with 1 second execution time and one job per time. 

Figure 6.2 illustrates the different initialization and ending times for two scheduling 
strategies, load balance and round-robin. The experiments show the performance of 
Round-Robin is much better than load-balance. The reason is the load-balance has to 
poll all the possible sites for online load information. This proves that the 
Round-Robin outperforms the load-balance in scalability. Another noticeable point is 
that the stability of Round-Robin is much better than load-balance. The reason is the 
overhead to poll a site for online load information varies dynamically over time 
regarding to the system and network situations. 




Figure 6.2 Initialization time and ending time of different jobs 



90 



Y.Z. Sun et al. 



Regarding of so many computers in a Grid environment, we should apply as simple 
load balance algorithms such as Round-Robin as possible. However, Figure 6.3 
illustrates that when the scale of Grid is not very much, a job with long enough 
execution time may suffer the same in the load balance as in the Round-robin in a 
small or middle scale. According to our experiments, we can suppose randomly 
selected site algorithm may be as efficient and scalable as Round-Robin. In the future, 
we should test this kind of load balance strategy. 



Blast(1s)-100 jobs atone time 




Job Index 



Figure 6.3 Run 100 jobs in parallel on a grid comprised of two servers. 



7 Summary 

In this paper, we have reviewed the challenging issues on grid service management, 
descript design principles of VEGA and introduce its architecture. Finally, we give an 
evaluation of VEGA, including routers, system overhead and throughput. 

VEGA combines experiences from operating system design and current IP 
networks. It is distinguished by its two design principles: a) ease of use, providing 
users with a location-independent way to access services, b) ease of deployment. The 
architecture of VEGA is fully decentralized without sacrificing efficiency. 

Currently, we are focusing on developing a new version of proof-of-concept 
prototype of the Vega Grid based on our version 1.0. 
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Abstract. The approaches to deal with scheduling and load balancing on PC- 
based cluster systems are famous and well-known. Self-scheduling schemes, 
which are suitable for parallel loops with independent iterations on cluster 
computer system, they have been designed in the past. In this paper, we propose 
a new scheme that can adjust the scheduling parameter dynamically on an 
extremely heterogeneous PC-based cluster and grid computing environments in 
order to improve system performance. A grid computing environment consists 
of multiple PC-based clusters is constructed using Globus Toolkit and SUN 
Grid Engine middleware. The experimental results show that our scheduling 
can result in higher performance than other similar schemes. 

Keywords. Parallel loops. Self-scheduling, PC-based clusters. Grid Computing 



1, Introduction 

Parallel computers are increasingly widespread, and nowadays, many of these parallel 
computers are no longer shared-memory multiprocessors, but follow the distributed 
memory model due to scalability factor. These systems consist of homogeneous 
workstations, where all these workstations have processors, memory and cache 
memory with exactly identical specifications. Nowadays, more and more systems are 
composed of homogeneous and clustered together with a number of heterogeneous 
workstations, where they may have similar or different architectures, speed, and 
operating systems. For this reason, first of all we have to do is to distinguish whether 
the target system is homogeneous or heterogeneous. Therefore, we define a frame of 
relativity to decide the cluster system to two typical cases comparatively, say 
relatively homogeneous and relatively heterogeneous. 

After the system architecture is clear, the next starting point is the task analysis. As 
we know, the major source of program parallelization is loop. If the loop iterations 
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can be distributed to different processors as evenly as possible, the parallelism within 
loop iterations can be exploited. Loops can be roughly divided into four kinds, as 
shown in Figure 1 : uniform workload, increasing workload, decreasing workload, and 
random workload loops. They are the most common ones in programs, and should 
cover most case. In a relatively homogeneous case, workload can be partitioned 
proportionally by computing power respectively to each working computer, but in 
relatively heterogeneous case, this method will not work. The self-scheduling scheme 
works well not only in moderate heterogeneous cluster environments but also in 
extremely heterogeneous environment where the performance difference between the 
fastest computer and the slowest computer is large. 



T3- 

O- 

u 



Time 

1 . Uniform workload 





Time 

2. Increasing workload 




Figure 1. Four kinds of loop style 



In this paper, we revise known loop self-scheduling schemes to fit both 
homogeneous and heterogeneous PC clusters environment. The FIINT Performance 
Analyzer [2] is given for a help to distinguish whether the target system is relatively 
homogeneous or relatively heterogeneous. Afterwards we partition loop iteration 
styles by four different ways according to the cluster system typical cases for 
achieving good performance in any possible executive environment. In this paper, we 
propose a new scheme that can adjust the scheduling parameter dynamically on an 
extremely heterogeneous PC-based cluster and grid computing environments in order 
to improve system performance. A grid computing environment consists of multiple 
PC -based clusters is constructed using Globus Toolkit and SUN Grid Engine 
middleware. The experimental results show that our scheduling can result in higher 
performance than other similar schemes. 



2. Background 

2.1. Self-scheduling 

Self-scheduling is a large class of adaptive/dynamic centralized loop scheduling 
schemes. In a common self-scheduling scheme, p denotes the number of processors. 
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N denotes the total iteration and f() is a function to produce the chunk-size at each 
step. At the i-th scheduling step, the master computes the chunk-size C, and the 
remaining number of tasks Ri, 

Ro=N, Crf(i.p). Ri-Ri-rQ 

where f() possibly has more parameters than just i and p, such as R,./. The master 
assigns C, tasks to an idle slave and the load imbalancing will depend on the 
execution time gap between tj, fory=l, •••,/? [7]. 



2.2. The a Self-scheduling Scheme 

In the previous scheduling paper [1], a% partition of workload was according to their 
performance weighted by CPU clock in the first phase and the rest (100-a)% of 
workload according to known self-scheduling in the second phase. The experimental 
results were conducted on a PC cluster with six nodes and the fastest computer is 7.5 
times faster than the slowest ones in CPU-clock cycle. Many various a values are 
applied to the matrix multiplication and a best performance is obtained with a=15. 
Thus, our approach is suitable in all applications with regular parallel loops. Through 
a Self-Scheduling Scheme, we get three new improved self-scheduling schemes; 
From FSS, GSS, TSS, so called NFSS, NGSS, and NTSS [1], where N means “new” 
here. 



3. Methodology 

The adjustment of scheduling parameters dynamically and fit multiform system 
architectures to accomplish our system has been implemented. Later, we combined 
Grid computing technology, the HINT Performance Analyzer, our a self-scheduling 
scheme, and the dynamic adjustment of scheduling parameters into a whole new 
approach. 



3.1. System Definition 

System definition is the first step in our approach. The HINT Performance Analyzer 
[2] is given for helping us to distinguish whether the target system is relatively 
homogeneous or relatively heterogeneous. We gather CPU performance capabilities, 
amounts of memory, cache sizes, and basic system performance by HINT. An 
updatable library, called System Information Array (SIA), is build to record the 
collection of the information. Define the two Cluster System Typical Cases as 
follows: 

Gather CPU Information, Pi, P 2 . . .P„, 

Assume Pi is the node that has the worst performance (working ability) of all. 

Say, P„=r”Pi 

Partition a % of workload according to their performance weighted by CPU clock 
and the rest (100- a )% of workload according to known self-scheduling scheme. 
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(1) Define Heterogeneous Ratio (HR), « MinQUIPS « J_ < a 7 100, 

p MaxQUIPS p 

where a ’ is the temporary value of a . 

(2) Case 1: If a ’ < HR, then we say the target system is relatively heterogeneous 
ease. 

Case 2: If a ’ > HR, then we say the target system is relatively homogeneous 
ease. 

(3) If the target system is relatively heterogeneous system, we start the a self- 
seheduling seheme with a = a ’ % 

If the target system is relatively homogeneous, then we run the HINT 
benehmark to build (and update) the SIA, and start the a self-seheduling 
seheme with a =100 % 

There is still a point for attention: not always update the SIA before eaeh time of 
job submission, only when the system has one or more new nodes added, SIA-update 
will be needed and a will be properly adjusted. 



3.2. Loop Styles Analysis 

For the programs with regular loops, intuitively, we may want to partition problem 
size aeeording to their CPU eloek in heterogeneous environment. However, the CPU 
eloek is not the only faetor whieh affeets eomputer performanee. Many other faetors 
also have dramatie influenees in this aspeet, sueh as the amount of memory available, 
the eost of memory aeeesses, and the eommunieation medium between proeessors, ete 
[5]. Using this intuitive approaeh, the result will be degraded if the performanee 
predietion is inaeeurate. A eomputer with largest inaeeurate predietion will be the last 
one to finish the assigned job. 

Loops ean be roughly divided into four kinds, as shown in figure 1: uniform 
workload, inereasing workload, deereasing workload, and random workload loops. 
They are the most eommon ones in programs, and should eover most eases. These 
four kinds can be classified two types: regular and irregular. The first kind is regular 
and the last three ones are irregular. Different loops may need to be handled in 
different ways in order to get the best performance. Since workload is predictable in 
regular loops, it is not necessary to process load balancing at beginning. 

We propose to partition problem size in two stages. At first stage, partition a% of 
total workload according to their performance weighted by CPU clock. In the way, 
the communication between master and slaves can be reduced efficiently. At second 
stage, partition following (100-a) % of total workload according to known self- 
scheduling scheme. In the way, load balancing can be archived. This approach can be 
suitable for all regular loops. An appropriate a value will lead to good performance. 

Furthermore, dynamic load balancing approach should not be aware of the run- 
time behavior of the applications before execution. But in GSS and TSS, to achieve 
good performance, computer performance of each computer in the cluster has to be in 
order in extreme heterogeneous environment, which is not very applicable. With our 
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schemes, this trouble will not exist. In this paper, the terminology “FSS-^0” stand for 
“a=80, and remainder iterations use FSS to partition” and so on. 

Example 1 

Suppose that there is a cluster consisting of five slaves. Each of computing nodes has 
CPU clock of 200MHz, 200MHz, 233MHz, 533MHz, and 1.5GHz, respectively. 
Table 1 shows the different chunk sizes for a problem with the number of iteration 
7=2048 in this cluster. The number of scheduling steps is parenthesized. 



Table 1. Sample partition size of Example 1 



GSS 


410, 328, 262, 210, 168, 134, 108, 86, 69, 55, 44, 35, 28, 23, 18, 14, 12, 9, 7, 6, 5, 
4,3,2, 2, 2, 1, 1, 1, 1 (N=30) 


GSS- 

80 


923, 328, 144, 123, 121, 82, 66, 53, 42, 34, 27, 21, 17, 14, 11, 9, 7, 6, 4, 4, 3, 2, 2, 
1, 1, 1, 1, 1 (N=28) 


FSS 


205, 205, 205, 205, 205, 103, 103, 103, 103, 103, 51, 51, 51, 51, 51, 26, 26, 26, 26, 
26, 13, 13, 13, 13, 13, 6, 6, 6, 6, 6, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 1, 1, 1 (N=43) 


FSS- 

80 


923, 328, 144, 123, 121, 41, 41, 41, 41, 41, 21, 21, 21, 21, 21, 10, 10, 10, 10, 10, 5, 
5, 5, 5, 5, 3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1 (N=39) 


TSS 


204, 194, 184, 174, 164, 154, 144, 134, 124, 114, 104, 94, 84, 74, 64, 38 (N=16) 


TSS- 

80 


923, 328, 144, 123, 121, 40, 38, 36, 34, 32, 30, 28, 26, 24, 22, 20, 18, 16, 14, 12, 
10, 8, 1 (N=23) 



To model our approach, we use following terminology: 

• r is the total workload of all iterations in a loop. 

• JVis the a% of total workload. 

• h is the fewest workload in an increasing/decreasing workload loop. It can be the 
workload of the first iteration (in an increasing workload loop) or the workload 
of the last iteration (in a decreasing workload loop). 

• h is the different of workload between consequence iterations, h is a positive 
integer. 

• V is the iteration number on which the a % accumulating workload is reached, x 
is positive real. 



3.3. System Modeling 

In our new parallel loop self-scheduling scheme, the HINT Performance Analyzer 
help us to decide the cluster system for two typical cases comparatively, and the next 
we must have proper reaction and appropriate self scheduling scheme processed on 
which system architecture and loop style are changeable. Parallel loop style analysis 
is essential since parallel loops can be roughly divided into four kinds, as shown in 
Figure 1 : uniform workload, increasing workload, decreasing workload, and random 
workload loops. They should be the most common ones in programs, and should 
cover most cases. Moreover, we implement the adjustment of scheduling parameters 
dynamically to fit multiform system architectures, and message passing interface 
(MPI) directives parallelizing code segment to be executed by multiple CPUs which 
is so called cluster. In the loop parallelism region, our self-scheduling scheme must be 






An Efficient Parallel Loop Self-scheduling on Grid Environments 



97 



hand inserted into souree eode in the region where the largest possible loops that may 
be parallelized. An example of how our new self-seheduling seheme works is shown 
in Figure 2. 




•Do serial work*/ 

'•finalize program*/ 




Figure 2. System model. 



4, Experimental Results 



4.1. Hardware and Software Configuration 

Our Grid arehiteeture is implemented on top of Globus Toolkit, name grid-eluster. It 
is built three PC elusters to form a eomputational grid environment (Figure 3). 

• Alpha site: Four PCs, eaeh PC has two AMD Athlon MP2000 proeessors, 
512MB DDRAM and Intel PROIOOVE NIC. 

• Beta site: Four PCs, eaeh PC has one Intel Celeron 1.7GHz proeessor, 256MB 
DDRAM, and 3Com 3e9051 NIC. 

• Gamma site: Four PCs, eaeh PC has two Intel P3 866 MHz proeessors, 256MB 
SDRAM and 3Com 3e9051 NIC. 
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SGE QMaster daemon is run on the master node of eaeh PC eluster, and SGE 
exeeute daemon is run to manage and monitor ineoming job and Globus Toolkit v2.4. 
Eaeh slave node is running SGE exeeute daemon to exeeute ineome job only. The 
operating system is RedHat Linux release 9. Parallel applieation we use MPICH-G2 
vl.2.5 for message passing. 




4.2. Experimental Results 
4.2.1. Regular Workload 

The experiment eonsists of three different seenarios: (1) Differenees performanee 
presentation of scheduling schemes in uniform workload. (2) Different grid 
environment and (3) Matrix multiplication with different matrix sizes. At first step, 
we run a MPI program on different grid system to evaluate the system performance. 
Second step, we connect these grid systems together to form a grid environment (In 
our testbed is grid Alpha, Beta and Gamma) Then, running the same MPI program to 
evaluate the system performance. Third step, through the different system topologies, 
we connect the system characteristics together for a performance analysis. Finally, we 
run the same MPI program to evaluate the system performance of different system 
architectures. Our new scheme can guarantee whether what kind of parallel loop 
scheduling situation happen, they can be properly well-arranged in our approach and 
achieved better performance than other scheme developed before, all of the 
performance analysis are presented in Figures 4, 5, and 6. 

Figures 4, 5, and 6 note that our approach connects these grid systems together to 
forni a grid environment (In our testbed is grid Alpha, Beta and Gamma) Then, 
running the same MPI program to evaluate the system performance and implements 
FSS, GSS, and TSS group approach. In previous methods, NFSS, NTSS, and NGSS 
get worse performance than new scheme with dynamic parameterization and 
systematic adjustment automatically. 
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Figure 5 



m 



Figure 6 



Figure 4. A chart of execution time of different sizes of matrix multiplication by grid a -l- i3 -l- 



Y . 



Figure 5. A chart of execution time of different sizes of matrix multiplication by grid P . 
Figure 6. A chart of execution time of different sizes of matrix multiplication by grid P + Y . 



4.2.2. Irregular Workload 

The experiment consists of three scenarios: Differences performance presentation of 
scheduling schemes in (1) Increasing workload. (2) Decreasing workload and (3) 
Random workload. Fig 7, 8, 9, note that execution time of simulated increasing, random, 
and decreasing workload loop by various self-scheduling approaches grid a + P + Y . 




Figure 7 




Figure 7. A chart of execution time of simulated increasing workload loop by various self- 
scheduling approaches grid a + p + y . 

Figure 8. A chart of execution time of simulated random workload loop by various self- 
scheduling approaches grid a + P + Y . 

Figure 9. A chart of execution time of simulated decreasing workload loop by various self- 
scheduling approaches grid a + P + Y . 



5, Conclusion and Future Work 

In this paper, we can find that Grid Computing technology certainly can bring more 
computing performance than the traditional PC Cluster or SMP system. Moreover, we 
try to draw up and integrate a nice and complete system implemented on parallel loop 
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self-scheduling. The system can guarantee whether what kind of parallel loop 
scheduling situation happen, they can be properly well-arranged in our system and 
achieved better performance than other scheme developed before. We revise known 
loop self-scheduling schemes to fit both homogeneous and heterogeneous PC clusters 
and Grid environment when loop style is regular or irregular. After enough feedback 
information has been investigated, collected, and analyzed, the performance will well- 
improved in each time of feedback information collection and job submission. Now 
we combine Grid Computing technology, the HINT Performance Analyzer, our a 
self-scheduling scheme, and the dynamic adjustment of scheduling parameters into a 
whole new approach successfully. The goal of achieving good performance on 
parallel loop self-scheduling by our approach is definitely practicable. The 
appropriate method to investigate the performance trend after the new computing 
nodes added and the proper way to adjust the value of a are our future work. 
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Abstract. This paper discusses a Java-based grid computing middleware, 
ALICE, to facilitate the development and deployment of generic grid 
applications on heterogeneous shared computing resources. The ALiCE 
layered grid architecture comprises of a core layer that provides the basic 
services for control and communication within a grid. Programming template in 
the extensions layer provides a distributed shared-memory programming 
abstraction that frees the grid application developer from the intricacies of the 
core layer and the underlying grid system. Performance of a distributed Data 
Encryption Standard (DES) key search problem on two grid configurations is 
discussed. 



1 Introduction 

Grid computing [4, 8] is an emerging teehnology that enables the utilization of shared 
resourees distributed aeross multiple administrative domains, thereby providing 
dependable, eonsistent, pervasive, and inexpensive aeeess to high-end eomputational 
eapabilities [5] in a eollaborative environment. Grids ean be used to provide 
eomputational, data, applieation, information serviees, and eonsequently, knowledge 
serviees, to the end users, whieh ean either be a human or a proeess. 

Grid eomputing projeets ean be hierarehieally eategorized as integrated grid 
systems, application(s)-driven efforts and middleware [1]. NetSolve [3] is one 
example of an integrated grid system. It is a elient/server applieation designed to 
solve eomputational seienee problems in a wide-area distributed environment. A 
NetSolve elient eommunieates, using Matlab or the Web, with the server, whieh ean 
adopt any seientifie paekage in the eomputational kernel. The European DataGrid 
[12] is a highly distinguished instanee of an applieation-driven grid effort. Its 
objeetive is to develop a grid dedieated to the analysis of large volumes of data 
obtained from seientifie experiments, and to establish produetive eollaborations 
between seientifie groups based in different geographieal loeations. Middlewares 
developed for grid eomputing inelude Globus [6], Legion [11]. The Globus 
metaeomputing toolkit attempts to faeilitate the eonstmetion of eomputational grids 
by providing a metaeomputing abstract machine', a set of loosely eoupled basie 
serviees that ean be used to implement higher-level eomponents and applieations. 
Globus is realigning its toolkit with the emerging OGSA grid standard [7]. Legion is 
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a metacomputing toolkit that treats all hardware and software components in the grid 
as objects that are able to communicate with each other through method invocations. 
Like Globus, Legion pledges to provide users with the vision of a single virtual 
machine. 

This paper presents ALiCE {Adaptive and scalable Internet-based Computing 
Engine), a grid computing core middleware designed for secure, reliable and efficient 
execution of distributed applications on any Java-compatible platform. Our main 
design goal is to provide developers of grid applications with a user-friendly 
programming environment that does away with the hassle of implementing the grid 
infrastructure, thus enabling them to concentrate solely on their application problems. 
The middleware encapsulates services for compute and data grids, resource 
scheduling and allocation, and facilitates application development with a 
straightforward programming template [15, 16]. 

The remainder of this paper is structured as follows. Section 2 describes the design 
of ALiCE including its architecture and runtime system. Section 3 discusses the 
ALiCE template -based distributed shared-memory programming model. Section 4 
evaluates the performance of ALiCE using a key search problem. Our concluding 
remarks are in Section 5. 



2 System Design 

2.1 The Objective of ALiCE 

Several projects, such as Globus and Legion, attempt to provide users with the vision 
of a single abstract machine for computing by the provision of core/user-level 
middleware encapsulating fundamental services for inter-entity communications, task 
scheduling and management of resources. Likewise, ALiCE is a portable middleware 
designed for developing and deploying general-purpose grid applications and 
application programming models. However, unlike Globus toolkit which is a 
collection of grid tools, ALiCE is a grid system. 

ALiCE is designed to meet a number of design goals. ALiCE achieves flexibility 
and scalability through its capability to support the execution of multiple applications 
concurrently and the presence of multiple clients within the grid. ALiCE enables grid 
applications deployment on all operating systems and hardware platforms due to its 
implementation in the platform independent Java language, unlike systems such as 
Condor [9], which is C-based and executes only on WinNT and Unix platforms. 
ALiCE also offers an API to achieve generic runtime infrastructure support, allowing 
the deployment of any distributed application: this is a major feature a middleware 
has to provide, which distinguishes itself from application-driven efforts that are 
problem-specific, like SETI@Home [14]. 

2.2 Architecture 

The AliCE grid architecture as shown in Figure 1 comprises of three constituent 
layers, AliCE Core, AliCE Extensions and AliCE Applications and Toolkits, built 
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upon a set of Java technologies and operating on a grid fabric. The ALiCE system is 
written in Java and implemented using Java technologies including Sun 
Microsystems’ Jini™ and JavaSpaces^^ [13] for resource discovery services and 
object communications within a grid. It also works with GigaSpaces™ [10], an 
industrial implementation of JavaSpaces. 



ALiCE 
Applications 
& Toolkits 



ALiCE 

Extensions 



ALiCE 

Core 



Java 

Technologies 




Figure 1: ALiCE Layered Grid Architecture 

The ALiCE core layer encompasses the basic services used to develop grids. 
Compute Grid Services include algorithms for resource management, discovery and 
allocation, as well as the scheduling of compute tasks. Data Grid Services are 
responsible for the management of data accessed during computation, locating the 
target data within the grid and ensuring multiple copy updates where applicable. The 
security service is concerned with maintaining the confidentiality of information 
within each node and detecting malicious code. Object communication is performed 
via our Object Network Communication Architecture that coordinates the transfer of 
information-encapsulated objects within the grid. Besides these grid foundation 
services, a monitoring and accounting service is also included. 

The ALiCE extensions layer encompasses the ALiCE runtime support infrastructure 
for application execution and provides the user with a distributed-shared memory 
programming template for developing grid applications at an abstract level. Runtime 
support modules are provided for difficult programming languages and machine 
platforms. Advanced data services are also introduced to enable users to customize 
the means in which their application will handle data, and this is especially useful in 
problems that work on uniquely formatted data, such as data retrieved from 
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specialized databases and in the physical and life sciences. This is the layer that 
application developers will work with. 

The ALiCE applications and toolkits layer encompasses the various grid 
applications and programming models that are developed using ALiCE programming 
template and it is the only layer visible to ALiCE application users. 



2.3 Runtime System 

Figure 2 shows ALiCE mntime system. It adopts a three-tiered architecture, 
comprising of three main types of entities: consumer, producer and resource broker, 
as described in the following: 




Figure 2: ALiCE Runtime System 



• Consumer. This submits applications to the ALiCE grid system. It can be any 
machine within the grid running the ALiCE consumer/producer components. It is 
responsible for collecting results for the current application ran, returned by the 
tasks executed at the producers, and is also the point from which new protocols and 
new runtime supports can be added to the grid system. 

• Resource broker. This is the core of the grid system and deals with resource and 
process management. It has a scheduler that performs both application and task 
scheduling. Application scheduling helps to ensure that each ALiCE application is 
able to complete execution in a reasonable turnaround time, and is not constrained 
by the workload in the grid where multiple applications can execute concurrently. 
Task scheduling coordinates the dissemination of compute tasks, thereby 
controlling the utilization of the producers. The default task scheduling algorithm 
adopted in ALiCE is eager scheduling [2]. 
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• Producer. This is run on a machine that volunteers its cycles to run ALiCE 
applications. It receives tasks from a resource broker in the form of serialized live 
objects, dynamically loads the objects and executes the encapsulated tasks. The 
result of each task is returned to the consumer that submitted the application. A 
producer and a consumer can be mu concurrently on the same machine. 

• Task Farm Manager. ALiCE applications are initiated by the Task Farm Manager 
and the tasks generated are then scheduled by the resource broker and executed by 
the producers. The task farm manager is separated from the resource broker for 
two principal reasons. Firstly, ALiCE supports non-Java applications that are 
usually platform-dependent, and the resource broker may not be situated on a 
suitable platform to run the task generation codes of these applications. Secondly, 
for reasons of security and fault tolerant the execution of alien code submitted by 
consumers is isolated from the resource broker. 



3 Grid Programming 

ALICE adopts the TaskGenerator-ResultCollector programming model. This model 
comprises of four main components: TaskGenerator, Task, Result and 

ResultCollector. The consumer first submits the application to the grid system. The 
TaskGenerator running at a task farm manager machine generates a pool of Tasks, 
belonging to the application. These Tasks are then scheduled for execution by the 
resource broker and the producers download the tasks from the task pool. The results 
of the individual executions at the producers are returned to the resource broker as 
Result object. The ResultCollector, initiated at the consumer to support visualization 
and monitoring of data collects all Result objects from the resource broker. 

The template abstracts methods for generating tasks and retrieving results in 
ALiCE, leaving the programmers with only the task of filling in the task 
specifications. The Java classes comprising the ALiCE programming template are: 

a. TaskGenerator. This is run on a task farm manager machine and allows tasks to be 
generated for scheduling by the resource broker. It provides a method process that 
generates tasks for the application. The programmer merely needs to specify the 
circumstances under which tasks are to be generated in the main method. 

b. Task. This is run on a producer machine, and it specifies the parallel execution 
routine at the producer. The programmer has to fill in only the execute method 
with the task execution routine. 

c. Result. This models a result object that is returned from the execution of a task. It 
is a generic object, and can contain as many user-specified attributes and methods, 
thus permitting the representation of results in the form of any data structure that 
are serializable. 

d. ResultCollector. This is run on a consumer machine, and handles user data input 
for an application and the visualization of results thereafter. It provides a method 
collectResult that retrieves a Result object from the resource broker. The 
programmer has to specify the visualization components and control in the collect 
method. 
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4 Performance Evaluation 

We have developed several distributed applieations using ALiCE. These include life 
science applications such as biosequence comparison and progressive Multiple 
Sequence Alignment [16], satellite image processing [15], distributed equation solver, 
etc. In this paper, we present the results of the DES {Data Encryption Standard) key 
search [18]. DES key search is a mathematical problem, involving the use of a brute 
force method to identify a selected encryption key in a given key space. A DES key 
consists of 56 bits that are randomly generated for searching, and 8 bits for error 
detection. In the algorithm, a randomly selected key, K, is used to encrypt a known 
string into a ciphertext. To identify K, every key in the key space is used to encrypt 
the same known string. If the encrypted string for a certain key matches with the 
ciphertext, then the algorithm converges and the value of K is returned. This problem 
requires immense computational power as it involves exhaustive search in a 
potentially huge key space. 

The test environment consists of a homogeneous cluster and a heterogeneous 
cluster with all nodes running RedHat Linux. The 64-node homogeneous cluster 
{Cluster 1) consists of dual processors Intel Xeon 1.4GHz processors with 1GB of 
memory. The nodes are connected by a Myrinet network. The 24-node heterogeneous 
cluster {Cluster II) consists of sixteen nodes Pentium II 400MHz with 256MB of 
RAM, and eight nodes Pentium III 866MHz with 256MB of RAM. These nodes are 
connected via 100Mbps Ethernet switch. 

Our performance metric is the execution time to search the entire key space. The 
sequential execution time grows exponentially with increasing key sizes. The DES 
key problem can be partitioned into varying number of tasks with a task size 
measured by the number of keys and its execution time can be estimated using the 
time from the sequential run. Table 1 shows the task characteristics for varying task 
sizes and problem sizes. The table was used to select an appropriate task size for the 
experiments to be carried out in the two grid configurations. 



task size 
(keys) 


32-bit Key 


36-bit Key 


40-bit Key 


no. 

of 

tasks 


Est. Time/Task 
(sees) 


no. of 
tasks 


est. time/task 
(secs) 


no. of 
tasks 


est. time/task 
(secs) 


eluster I 


cluster II 


cluster I 


cluster I 


5,000,000 


859 


20.8 


32.9 


13,744 


20.8 


219,902 


430.3 


10,000,000 


429 


41.1 


65.4 


6,872 


43.4 


109,951 


862.4 


30,000,000 


143 


122.9 


196.6 


2,291 


127.8 


36,650 


2587.7 


50,000,000 


86 


201.9 


322.0 


1,374 


211.4 


21,990 


4314.3 


100,000,000 


43 


395.4 


641.2 


687 


420.1 


10,995 


8629.5 



Table 1: Estimated Task Execution Times for Varying Task Sizes 

For our experiments conducted, we selected a task size of 50 million keys per task 
and a problem size of 36-bit keys for Cluster I and 32-bit keys for Cluster II. Table 2 
shows the results for 4 to 32 producer nodes. The execution time for key search 
reduces significantly with increasing number of nodes, resulting in greater speedup. 





ALiCE: A Scalable Runtime Infrastructure for High Performance Grid Computing 



107 



No. of Producers 


Cluster I 
(36-bit Key) 


Cluster II 
(32-bit Key) 


1 (Est. Sequential) 


78 hr 23 min 


8 hr 43 min 


4 


23 hr 36 min 


3 hr 43 min 


8 


1 1 hr 6 min 


2 hr 7 min 


10 


8 hr 34 min 


1 hr 42 min 


12 


7 hr 2 1 min 


1 hr 26 min 


16 


5 hr 1 1 min 


1 hr 7 min 


32 


2 hr 29 min 


- 



Table 2: Execution Time for Varying Number of Producer Nodes 

We define speedup as TJTp, where T, is the execution time of the sequential 
program and Tp is the execution time of the derived parallel program on p processors. 
As shown in Figure 5, a speedup of approximately 32 is attained for key size 36-bits 
on Cluster I and 8 for 32-bits on Cluster II. We consider these results highly 
encouraging, although the performance of key search needs to be further evaluated 
with more key space sizes and nodes. The effects of using other scheduling 
algorithms in the resource broker must also be studied, as it may result in different 
overheads to the execution time. 




No. of Producers 



Figure 3: Speedup vs Varying Number of Producers 



5 Conclusions and Further Works 

We discussed the design and implementation of the Java-based ALiCE grid system. 
The runtime system comprises of consumers, producers and resource broker. Parallel 
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grid applications are written using programming template that supports the 
distributed-shared memory programming model. We presented the performance of 
ALICE using the DES key search problem. The result shows that a homogeneous 
cluster yields greater speedup than on a heterogeneous cluster for the same task size. 
A homogeneous cluster generally has a better load balance than a heterogeneous 
cluster which is made up of different platforms and capabilities. 

Much work still needs to be done to transform ALICE into a comprehensive grid 
computing infrastructure. We are in the process of integrating new resource 
scheduling techniques and load-balancing mechanisms into the ALiCE core layer to 
reduce the overhead in running applications [17]. Task migration, pre-emption and 
check-pointing mechanisms are being incorporated to improve the reliability and 
fault-tolerance ability of the system. 
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Abstract. When adopting the mediator architecture to integrate distributed, 
autonomous, relational model based database sources, mappings from the 
source schema to the global schema may become inconsistent when the rela- 
tional source schema or the global schema evolves. Without mapping adapta- 
tion, users may access no data or wrong data. In the paper, we propose a novel 
approach the global attribute as view with constraints (GAAVC) to publish 
mappings, which is adaptive for the schema evolution. Also published map- 
pings satisfy both source schema constraints and global schema constraints, 
which enable users to get valid data. We also put forward the GAAVC based 
mapping publishing algorithm and mapping adaptation algorithms. When we 
compare our approach with others in functionality, it outperforms. Finally the 
mapping adaptation tool GMPMA is introduced, which has been implemented 
in the middleware of railway information grid system. 



1 Introduction 

A serious issue in integrating distributed, autonomous, relational model based data- 
bases is the evolution of sehemas. As for mediator arehiteeture, mappings from souree 
sehemas to the global sehema are used to ereate souree views that query data from 
sourees. When sehemas ehange, users will get invalid data without modifying map- 
pings. However it is time-waste and heavy work to revise mappings manually. In this 
paper we propose a GAAVC based mapping publishing algorithm and eorresponding 
mapping adaptation algorithms to automatieally adjust mappings. 

Our main eontributions are as follows. (1) We put forward the nested sehema de- 
eomposition model aeeording to whieh we propose the approaeh the global attribute 
as view to eonstruet mappings. (2) We develop the GAAVC based mapping publish- 
ing approaeh and the eorresponding algorithm. The approaeh is adaptive to the evolu- 
tion of the souree sehema and the global sehema. Also souree views ereated by the 
mappings guarantees that data from sourees is valid. (3) We develop algorithms to 
adjust invalid mappings automatieally as sehemas evolve. (4) We implement the 
GMPMA tool that implements the algorithms. 

Seetion 2 introduees related work. Seetion 3 defines valid mappings. Seetion 4 
gives the GAAVC approaeh and the mapping publishing algorithm. Seetion 5 intro- 
duees mapping adaptation algorithms. Seetion 6 eompares the GAAVC approaeh with 
other approaehes. Seetion 7 deseribes the arehiteeture of GMPMA tool before seetion 
8 eoneludes. 

H. Jin et al. (Eds.): NPC 2004, LNCS 3222, pp. 110-117, 2004. 

© IFIP International Federation for Information Processing 2004 




Mapping Publishing and Mapping Adaptation 



111 



2 Related Works 

The approach adapts to the schema evolution if only local attributes that have no in- 
valid mappings can still be accessed as schemas evolve. 

The local as view (LAV) method defines the local schema as the view over the 
global schema. When schemas evolve, all the local attributes that are in the same 
schema cannot be accessed. So LAV is not adaptive to schema changes. Because 
source schemas match the global schema, source schema constraints are satisfied 
when global constraints are satisfied by views. The global as view (GAV) approach 
defines the global schema as the view over the source schema. When schemas change, 
all the local attributes that are in the same view cannot be accessed. So GAV has 
problems when schemas change. Because the global schema matches source schemas, 
global schema constraints are satisfied when source constraints are satisfied by views. 
The global-local as view (GLAV) approach is a variation of LAV. The approach 
has the same problem with the LAV approach. The both as view (BAV) approach 
defines the global attribute with the view of the local schema and the local attribute 
with the view of the global schema. The approach is adaptive for the schema evolu- 
tion. However, the view is defined manually. The approach has considered constraints 
of the global schema and source schemas. The correspondence view (CV) approach 
sets up the mapping of view, other attributes of the same view cannot be accessed 
when schemas change. So the approach is not adaptive to the schema evolution. What 
is more, the approach considers foreign key constraints but not key constraints of the 
global schema. 



3 Valid Mappings 

We study relational based XML and relational schemas with key constraints and for- 
eign key constraints in this paper. We suppose that there is at most one view for a set 
of elements in XML schema or tables in relational schemas. 

Definition 3.1 An attribnte is of the form <ID, label, type>. 

We use Greek alphabets, a , 0 , •••, to represent a set of attributes. 

Definition 3.2 A schema is of the form (< ID, label>, {attribute}, {schema}). 
Definition3.3 A mapping of global attribntes is an expression of source views: 
Mapping(a^): ),L(a^^ ),..., LCa^ )) = )• 

global attributes. V is a function of attributes of one source and the V (a^ ) is the 
view of the source s . (i = 1, 2, . . . , n). G is a function of views of different sources. 

F is a composite function of G and V and F is a function of attributes of different 
sources. The expression is made up of union, intersection and set difference operators. 
Definition 3.4 The mapping of the global schema R is the set of the mappings of all 
the attributes of the schema: Mapping ^={a^= | e i?} ■ 

The expression a P denotes that a is the key of the schema of 0 and 

Cl ^FK denotes that is the foreign key of a . Mapping ^ («) denotes the 

mapping of a in source Si. 
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Lemma 3.1 (Null value coustraiuts for foreigu key, briefly a -*fk null ) Sup- 
pose a — fk 0 ,if “ =null then 0 =null. 

Defiuitiou 3.5 Valid data are data that satisfy key constraints and null value con- 
straints for foreign key of the global schema. 

Definition 3.6 A valid source view is a source view that satisfies key constraints and 
null value constraints for foreign key of the global schema. 

Definition 3.7 Valid mappings of a global schema are mappings that produce valid 
source views. 

By valid mappings of a global schema, users can get valid data from databases. 



4 The Global Attribute as View with Constraints (GAAVC) Based 
Mapping Publishing Approach 

4.1 The Nested Schema Decomposition Model 

R(k, Ai, R’(k’,Ai’,-),-,A„) = R,(k)UR2(k,Ai)UR3(k, k’)UR4(k, k’,Ai’)U - U 
R„(k, Am), R is a nest relational schema or XML schema. R’ is the sub-schema of R. k 
is the key of R and k’ is the key of R’. k = (ki, k 2 ,”', k„). k’ = (ki’, k 2 ’,"% k„ ’). Atom 
schemas, Ri,"% R„, inherit foreign key constraints of R. 

Lemma 4.1 In the nested schema decomposition model, the decomposition is a loss- 
less join decomposition. 

The lemma 4. 1 indicates that no information will be lost after the decomposition. 



4.2 The Global Attribute as View with Constraints (GAAVC) Approach 

Definition 4.1 A view key set is the keys’ union of all the schemas of the view: 
key(Ri,R 2 ,...,Rn) = key(Ri)Ukey(R 2 ) U ••• Ukey(R„) 

Lemma 4.2 The view key set is the key of the view. 

Definition 4.2 Key constraints on mapping are the global schema key constraints on 
mapping: If 0 g^Ksv “ g, ^g = F(Ps), «g = F(Qs), ^ s = 

(a a a )’ then for any a ^ s (i =1,2,‘", m), there is a view key set 

(ISSjsSn), S'i^Sj. 

Lemma 4.3 (key constraints on null mapping) If Pg^KEvCig, Mappings( g) = 
null, and the mappings of ^ and u g satisfy key constraints on mapping, then Map- 
pings( a g)=null. 

Lemma 4.4 (foreign key constraints on null mapping, 0 g^FK null “ g) if P g ^fk 

a g, Mapping( P g)=null, then Mapping( a g)=null. 

Definition 4.3 (the global attribute as view approach, GAAV) the GAAV approach 
is using source views to express each atom schema of the global schema. 
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Definition 4.4 (the global attribnte as view with constraints (GAAVC) approach) 

Given the nested schema R{k , A^,R^ {k^ , A^i, A^), the mapping expres- 
sion of R. Rmapping^{ R- = F, ( a ) ’ A, = ( .4 , ^ ^ ^ A,^ ) ’ 

= Fj (a ‘s, , a ‘sj a‘s„) > A \ = F^(A's,,A',^,..., A\^) ,•••, 

(A^ ^ , A^ ^ A^^ ^ ) }> mappings are constructed according to the fol- 

lowing rules: 

1) According to the nested schema decomposition model, the approach constructs 
mappings between atom schemas and source schemas, which are equal to mappings 
between the original global schema and source schemas 

2) Mappings of atom schemas satisfy key constraints on mapping. 

3) Mappings of atom schemas satisfy foreign key constraints on null mapping. 

4) The mapping expression of an atom schema is the expression of the attribute in 
deepest schema, while operands are all the attributes of the atom schema. 

Theorem 4.1 Mappings that created by the GAAVC approach are valid mappings. 



4.3 Mapping Expressions 

Mappings of atom schemas of the global schema are expressed as follows: 

The key mapping expression: <addKey, <global key>, expression («source at- 
tributes>, source view constraints>)> 

The non-primary attribute mapping expression: <addAttribute, <global key, global 
non-primary attribute>, expression («source attributes>, source view constraints>)>. 

The GAAVC approach adapts to schema evolutions because attributes that have no 
invalid mapping can still be accessed when schemas change. 

Lemma 4.5 Given k^keyA, source view constraints of R(k) and R(k, A) are the 
same. 



4.4 The GAAVC Based Mapping Pnblishing Algorithm 

Given a global schema, source schemas and a set of mappings between the global 
schema and source schemas, we set up mappings according to GAAVC approach. 

Step 1 : verify if mappings satisfy key constraints on mapping. If not, exit. 

Step 2: get M the minimal cover of global functional dependencies. 

Step 3 : get ordered partitions of M so that P and a are in the same partition and P 
is before a if P^key “ or P^fk “ • 

Step 4: build mapping expressions of attributes of each ordered partition in sequence. 
The mapping expression of a is built only if for any attributes P, P^^key 
or P^fk “ , Mapping(P)^null. 

Step 2 gets rid of redundant functional dependencies of global schema and step 3 
makes partitions to decrease searching times. 
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5 Mapping Adaptations 

When a schema changes, we should adjust mappings to ensure them valid. Adding at- 
tributes has no influence on mappings. When renaming a schema or an attribute, we 
only review all the mapping expressions to change the name. 



5.1 Deleting an Attribnte or Deleting the Mapping of an Attribnte 

The step 2, 3, 7 and 8 in the following algorithm ensure key constraints on mapping. 

The step 4, 5,1 1, and 12 ensure foreign key constraints on null mapping. 

deleteAttribute ( “ { 

if a is a global attribute, that is a ^ 1 

for each mapping expression E that contains a ^ 2 

delete E 3 

for each attribute set A, Mapping ( A ) v^null , 

K ^p^A , K is the key of E 4 

deleteAttribute ( A ) 5 

if a k is an attribute of source S, that is a ^ 6 

for each mapping expression E that contains a ^ 7 

delete E the mapping of source S 8 

if the global attributes P of E is key 9 

if no mapping expression contains 0 10 

for each attribute set A, Mapping ( A ) ^^null 
and ^ ) 11 

deleteAttribute ( A ) } 12 

5.2 Deleting a Foreign Key Constraint 

Step 2 and step 3 in the following algorithm ensure key constraints on mapping, while 

step 5 and step 6 ensure foreign key constraints on null mapping. 

deleteForeignKey ( a ) { 

if a and are attributes of source S 1 

for each mapping expression E that contains a 2 

delete the mapping of E in source S 3 
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if the global attributes P in E are the primary 
key and no mapping expression contains P 4 

for each global attribute set ^ , 

Mapping ( A ) ^null and 5 

deleteAttribute ( A ) } 6 

5.3 Adding a Foreign Key Constraint 

When adding a source foreign key constraint, the existing mappings are still valid. 
When adding a global foreign key constraint, step 3 and step 4 in the following algo- 
rithm ensure foreign key constraints on null mapping. 

addForeignKey ( a 1 

if a and are global attributes 2 

if Mapping ( 0 ) v^null and Mapping ( a ) =null 3 

deleteAttribute ( 0 ) } 4 

5.4 Adding a Source Mapping 

The foreign key constraints on null mapping are ensured by step 1 , step 2 and step 3 . 
AddSourceMapping (Mapping^ ( a = F ( a J ) { 

for each global attributes A, 1 

if Mapping ( A ) = null 2 

exit 3 

if the mapping of a satisfy key constraints 
on mapping when adding Mapping^ ( a = F(aJ 4 

add the mapping Mapping^ ( a ^) = F ( a ^) } 5 

5.5 Deleting a Source Mapping 

Step 2 ensures key constraints on mapping. Step 4, 5 and 6 ensure foreign key con- 
straints on null mapping. 

deleteSourceMapping (Mapping^ ( = F ( J ) { 

for each mapping expression E that contains a 1 

delete the mapping of E in source S 2 
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if is a set of attributes of key k, 3 

if Mapping ( K ) = null 4 

for each attributes A ,Mapping( A ) and k^^^A 5 

deleteAttribute ( A ) } 6 



6 Comparisons 



Table 1. Mapping publishing approaches comparisons 



■" — — J^uWishing methods 
Comparisons^ — 


LAV 


GAV 


GLAV 


BAV 


CV 


GAAVC 


Be adaptive to schema evolu- 
tion 


No 


No 


No 


Yes 


No 


Yes 


Consider key and foreign key 
constraints 


Yes 


Yes 


Yes 


Yes 


Only foreign 
key constraints 


Yes 


Adjust mappings manually 


No 


No 


No 


Yes 


No 


No 



The detailed discussion is in related works. From the table 1, it can be seen that our 
GAAVC approach outperforms. 



7 Architecture of the GAAVC Based Mapping Publishing and 
Mapping Adaptation (GMPMA) Tool 

We implement the GAAVC based mapping publishing algorithm and mapping adap- 
tation algorithms using the GMPMA tool. Figure 1 shows the architecture of the 
GMPMA tool. With the graphical user interface or the monitor, schema evolutions are 
detected. Then the mapping publishing and mapping adaptation engine constructs 
mappings or adjusts mappings. The architecture has been implemented in the mid- 
dleware of the railway information grid system. 




Fig. 1. The GMPMA Architecture 
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8 Conclusions 

To ensure users to get valid data when schemas evolve, we propose the GAAVC 
based mapping publishing algorithm and mapping adaptation algorithms and imple- 
ment them in the GMPMA tool in the middleware of railway information grid system. 
Our approach is unique in many ways: 1) The GAAVC approach constructs the map- 
pings that adapt to the schema evolution. 2) Our mappings enable users to get valid 
data. 3) We consider mappings from one computation expression of some source at- 
tributes to one global attribute. How to make adaptation to schema evolutions of ob- 
ject-oriented databases is our future work. 
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Abstract. The job shop scheduling problem that is concerned with minimizing 
makespan is discussed. A new heuristic algorithm that embeds an improved 
shifting bottleneck procedure into the Tabu Search (TS) technique is presented. 
This algorithm is different from the previous procedures, because the improved 
shifting bottleneck procedure is a new procedure for the problem, and the two 
remarkable strategies of intensification and diversification of TS are modified. 
In addition, a new kind of neighborhood structure is defined and the method for 
local search is different from the previous. 

This algorithm has been tested on many common problem benchmarks with 
various sizes and levels of hardness and compared with several other algo- 
rithms. Computational experiments show that this algorithm is one of the most 
effective and efficient algorithms for the problem. Especially, it obtains a lower 
upbound for an instance with size of 50 jobs and 20 machines within a short pe- 
riod. 



1. Introduction 

The job shop scheduling problem with which we are concerned consists in scheduling 
a set of jobs on a set of machines for the objective of minimizing the make-span, i.e., 
the maximum of completion time needed for finishing all the jobs. Any scheduling is 
subject to the constrains that each job has a fixed processing order through the ma- 
chines and each machine can process at most one job at a time. 

The job shop scheduling problem is NP-hard in a strong sense and even is one of 
the hardest combinational optimization problems [5]. It is well known that only small 
size instances can be solved with a reasonable computational time by exact algo- 
rithms, however, for large size instances, some encouraging results have been recently 
obtained with heuristic algorithms that are based on local search method [1,8,14]. 
Generally, Starting from an initial feasible solution, a local search method iteratively 
selects a proper solution from the neighborhood. As the observation of Van Laarho- 
ven et al. [17] and Nowicki et al. [18], both the choice of a good-initial solution and 
the neighborhood structure are important aspects of algorithm’s performance. 

This paper is a further research based on our recent work, and a heuristic algo- 
rithm that is based on a Tabu Search (TS) technology and on the improved shifting 

H. Jin et al. (Eds.): NPC 2004, LNCS 3222, pp. 118-128, 2004. 
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bottleneck procedure (ISB) [9] is presented. Here, ISB is used to find a good-initial 
solution, and the local re-optimization procedure of ISB is used to direct the local 
search of TS from a region to some different one in the solution space. In the local 
search procedure of TS, we define a new kind of neighborhood structure that is differ- 
ent from the previous. These two points make certain of the efficiency and effective- 
ness of our algorithm. 

In this paper, the job shop scheduling problem is formalized in terms of a ma- 
thematical model and is represented on a disjunctive graph. Then, the TS technique 
with two strategies of intensification and diversification are analyzed, and the new 
heuristic algorithm, denote TSISB, is described. Finally, computational results on 
several test problems instances are shown, and the algorithm is compared with some 
typical algorithms for the problem. 



2, The Problem Definition 

Let {1, 2, ...,«} be a set of jobs, {1, 2, ..., m} be a set of machines and F= {0, 

1, 2, ..., A, #} be a set of operations. Each job consists of a sequence of operations 
each of which has to be processed on a given machine for a given time. Here, 0 and # 
represent the dummy start and finish operations, respectively. A schedule is an alloca- 
tion of each operation to the time (start time) from which it is processed. In other 
words, it is an allocation of processing order of the operations on the machines. The 
problem is to find a schedule that minimizes the make-span, which is subject to con- 
straints: (i) the precedence of operations on each job must be respected; (ii) once a 
machine starts to process an operation it can not be interrupted and each machine can 
process at most one operation at a time. Let A denote the set of pair of adjacent opera- 
tions constrained by the precedence relations as in (i); Vk denote the set of operations 
that are processed by the machine k{k&M)\ Jjtx be the set of pairs of opera- 
tions which therefore have to be sequenced as specified in (ii); and t, be the process- 
time (fixed) and the start time (variable) of the operation i (ieV), respectively. The 
process-times of both 0 and # are zero, i.e. do^d# = Q. Now, the problem can be stated 
as following mathematic model: 

min t# 

h>0 i&V 

tj-ti>di ( 1 ) 

tj - ti > dj V ti - tj > dj (i,j) e iifo ksM 

The first set of constraints means that t = 0 is the start time of the system, and 
the next two represent the constraints (i) and (ii), respectively, where “v” means “or”. 
Any solution of (1) is called a schedule, a feasible solution of the problem. 

It is useful to represent this problem on a disjunctive graph G~(V, A, E) [4], 
where V is the set of nodes, A is the set of ordinary (conjunctive) arcs and E is the set 
of disjunctive arcs. The node, the directed arc and the disjunctive pair-arc of G corre- 
spond to operation, precedence relation of two adjacent operations of a job and the 
pair-operation that are processed by the same machine, respectively. So, E = 
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(ksM), where E^- is the subset of disjunctive pair-arc corresponding to the pair- 
operation that are processed by the machine k. The weight (length) of each arc {i,j) is 
di that infers the process time of operation i, where is K, {i,j)GAKjE and operation i is 
processed right before operation j. 

Fig. 1. is the disjunctive graph for an instance with n^3, 3, and 8. The 

number of each conjunctive arc is the weight (length) of the arc and the weight of each 
disjunctive arc is removed. 

A subset of Ei^ {k^M) is called a selection that contains just one arc of each 

disjunctive pair-arc of E^ and St 
is acyclic if it doesn’t contain any 
cycle. According to Adams [2], a 
feasible processing order of the 
operations on the machine k is 
equivalent to the only one acyclic 
selection St, and, to determine a 
processing order of the operations 
on a machine is to sequence this 
machine. So, to sequence machine 
k is to find an acyclic selection St 
oiEt. 

Let Mo be the set of the machines that have been sequenced, so, a partial selec- 
tion S is the union of selections St, one of each EtiksMo). It is easy to understand that 
each S gives rise to a directed graph Ds=(V, AuS). If Ds is acyclic, then S is acyclic, 
however, the converse is not true [2]. A complete selection 5(i.e. M(,=M) that gener- 
ates an acyclic directed graph Ds defines a schedule, and it is a feasible solution of the 
job shop scheduling problem. To solve the problem is to find a complete selection S 
that gives rise to an acyclic Ds* and minimizes the length of the longest (critical) path 
in Ds*. 

Let {Vt, Et), any acyclic selection St in Et corresponds to the only one Ham- 
ilton path (denote Ht) of Gt, and the inverse is also true [2]. In this paper, the acyclic 
St, 5=u5i(A:eMo) andZ)s=(F, AkjS) are replaced by the L4, H^'uHt{keMo) and D^^ 
(V, AuH), respectively. 

For a feasible solution (a 
schedule) D^, the swap of two 
adjacent operations processed 
by the same machine on the 
critical path may improve the 
solution [17], which is usually a 
base to define the neighborhood 
for local search. For this reason, 
the critical path is decomposed 
into a series of critical blocks 
{B^, B 2 , ..., Br) [13]. Each block 
contains the operations proc- 
essed on one machine, and any two operations, in the block Bj and 7?,+|(l< i <r), re- 
spectively, are processed by different machines. 





Fig. 1. Disjunctive graph of an instance 
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Fig.2. shows a feasible solution via a digraph Du, H= ((5,1), (l,6))u((2,7))u 
((4,8), (8,3)). A eritieal path P(0, #) in this D// is (0,4,5, 1,6, 7, 8, 3,#) with r=4, 7?i = (4), 
7?2^(5,1,6), T ?3 = (7) and 7?4 = (8,3) and length of 47. 



3. The Algorithm 

Both the teehnique TS and the proeedure ISB are the eomerstone of the new algo- 
rithm TSISB. In this seetion these two teehniques are deseribes in details. 

3.1 The Tabu Search 

The Tabu Seareh teehnique proposed and formalized by Glover [10,11] is a meta- 
heuristie algorithm that is used to get optimal or near-optimal solution of eombina- 
tional optimization problems. This method is based on an iterative proeedure of 
neighborhood seareh, to find a member 6* in a eertain finite set £2 of feasible solu- 
tions, where 9 minimizes some objeetive funetion C(»). 

Neighborhood seareh methods are iterative proeedures in whieh a neighborhood 
N{0) must be predefined for eaeh solution 6&f2, and eaeh neighbor of 6*is defined by 
some modifieations of 6*it-self The next solution 9 to 6*(one of the neighbors of 6*) 
is searehed among N{9), and a step from 9 to the 9 is usually ealled a move. Starting 
from a eurrent feasible solution 9‘^, all the neighbors in N{9‘^) are examined and the 
solution 9 with usually the best value of the objeetive funetion is ehosen as the next 
solution, 9 \ C{9)<C{9'){9 , 9" ^N{9)). It is just the greedy seheme that is easy to 
get stuek of loeal optima. So, the strategy that the movement from 9‘^ to 9 sA(6*^) is 
allowed even if C{9 ) > C{9'^) helps the seareh eseape from the trap of loeal optima. 
This strategy is one of the important eharaeters of TS teehnology. 

With TS seheme, the eyeling, i.e. the seareh return to the solution that has been 
visited, may be met. To prevent this eyeling, a strueture ealled Tabu list L with length 
I is introdueed in order to prevent the seareh from returning to a solution visited 
within the last I iterations. In general, the TS proeess stops when the C{9) is elose 
enough to the lower bound of C(»), or, when no improvement oeeurs over the best so- 
lution within a given number of iterations or the time-limit runs out. 

3.2 Neighborhood Structure 

It is known that there is no any fork on the searehing traek of TS. To this end, there 
must be more information to direet the exploration. In faet, there are short-term and 
long-term information eoneemed with the exploration proeess. This systematie use of 
the information is the essential feature of TS. The approaeh uses this strategy not only 
to avoid eyeling but also to explore new direetions in the neighborhood. The short- 
term information represented by the Tabu list, is based on the last I iterations and will 
partly prevent eyeling. The long-term information eontains C , the best value of C(») 
found by TS so far. 
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The exploration proeess in £2 is deseribed in terms of move. For eaeh solution 6 
e/3, let M{9) denotes the set of moves that ean be applied to 9, and let a next solution 
of 6*be 9®v, then the neighborhood of 6* ean be denoted ?&N{9) = {w\3v&M{9}, w 
=9®v}. In general, the move is reversible, i.e. for eaeh v there exits a move v ’ sueh 
that (9® v) © V * = 9. So, instead of storing the information of eomplete solution, the 
Tabu list stores only the move it-self or the reverse of it assoeiated with the move ae- 
tually performed. Unfortunately, the restrietions of the Tabu list sometimes are so 
strong that they prevent getting a very good solution. This short-eome ean be over- 
eome by using a sort of long-term information, aspiration criterion that allows the al- 
gorithm to ehoose a move from those forbidden moves, i.e. tabu moves. A tabu move 
applied to a solution 9 is promising if it gives a solution better than the best one so far 
found. 

The neighborhood stmeture used in our algorithm is deseribed as following. In 
the next eontent, we denote the job-predeeessor and job-sueeessor of an operation w 
by p(w) and ^(w), respeetively. As a matter of faet, if p(w) or ^(w) exists, then (p(w), 
w) or (w, ^(w)) belong to A. For a feasible solution Dh, denote L(w, u) as the longest 
path from w to m in Dh, and P(0, #) as a eritieal path of Dh. Let (w, h\, I 12 ,..., hti) be a 
eritieal bloek of P(0, #), and p(w) be in P(0, #). For any A,(; =1, 2,. . ., k), if there is 

Li0,w)>L(0,pihd) (2) 

then a backward move on w and /?,, i.e., let operation /?, be proeessed right before the 
operation w, will yield a new feasible solution [9]. This new solution is looked as one 
of neighbors of Dh. Also, let (hi, hj, ..., hi, u) be another eritieal bloek ofP(0, #), and 
s(u) be in P(0, #). For any /?,(/ =1, 2,. ..,/), if there is 

L(f},s(hj))>L(f},u) (3) 

then a forward move on hj and u, i.e., let operation hj be proeessed right after the op- 
eration u, will yield a new feasible solution[9]. This new solution is looked as one of 
the neighbors of Dh, too. For all eritieal bloeks of P(0, #), we test them by use of the 
inequalities (2) and (3) to generate all neighbors of Dh. It has been proved that swap 
w and hi or hi and u must lead feasible neighbor of Dh, so the two inequalities are ig- 
nored in our algorithm. 

Based on these two kinds of moves, the neighborhood of Dh eonsists of all those 
neighbors of it. Furthermore, this new kind neighborhood stmeture is different from 
all those used by previous authors, sueh as Larrhoven et al.[15], Nowieki et al.[16] 
and Pezzella et al.[17]. Compared with our neighbor- hood, those of Larrhoven and 
Pezzella are larger, whieh slows down the loeal seareh, and that of Nowieki is smal- 
ler, whieh usually limits the loeal seareh in a quite narrow area of the solution spaee. 
Balas takes use of a kind of neighborhood stmeture similar to ours, whieh gets a ba- 
lanee between the seareh speed and seareh spaee [5]. Flowever, we reduee the eompu- 
tational eost in eaeh step greatly. 

Now, let 9 be the best solution so far and / be the iteration eounter, the TS pro- 
eedure is deseribed as follows: 
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Stepl. Choose an initial solution 6, set 6*^ 6, 7=0; 

Step2. 1 = 7+1, and generates the subset N* of N{9) sueh that either the applied 
move does not belong to the Tabu list or at least one of the aspiration eriteri- 
ons satisfied; 

Step3. Choose a best solution O&lsf*' aeeording to the objeetive funetion C(») of the 
problem; 

Step4. If C{0}<C{6*), then set 6*^ 6. Update the tabu list and the aspiration erite- 
rion; 

steps. If stopping eriterion is met, then stop. Otherwise, go to Step 2.. 

Where, Steps2, 3, 4 eonsists of the loeal seareh proeedure of TS. Some of the 
stopping mles are as follows: N{9)=(f ) ; 7 is larger than the maximum number of 

iterations or the number of iterations is larger than a speeifie number sinee the last 
improvement of the best solution and the optimal or near-optimal solution is found. 

3.3 Initial Solution and Tabu List 

In most of the algorithm assoeiated with TS teehnology, a good initial solution is fun- 
damental for the eomputational performanee of the algorithm. ISB is an affeetive al- 
gorithm for the job shop seheduling problem. This ehoiee generation of the initial fea- 
sible solution allows our algorithm to obtain quite good solutions in eomparable 
eomputational time or the same solution in shorter eomputational time. 

The improved shifting bottleneek proeedure is based on the famous SB that is 
proposed by Adams[2], and it solves one maehine problem by DS (Sehrage algorithm 
with disturbanee). The main steps of ISB is as following: 

Stepl. Identify the bottleneek maehine and sequenee it with DS; 

Step2. Re-optimize the maehines of A7o with DS in turn (at most 3 times), while 
keep the others fixed. stop. Otherwise, go to Stepl.. 

The proeedure iterates over eaeh maehine and finishes when no improvement is 
found. At the last step, after the last maehine has been sequeneed, the proeedure eon- 
tinues to loeal re-optimization until there is no improvement for the full eyele. 

In the aeyelie digraph Dh, let r, =7.(0, i), qi=L(i, #) - di, DS is given as following: 

Stepl. Set t = min{r,; left}, Vk, 

Step2. If r, > t, then m, = - S{ r, - 1), otherwise m, ;sR; 

Step3. Choose an operation from R, say j, with the greatest Uj, and if there are ties, 
break them by giving preferenee to the greatest qj, and if there are ties still, break 
them by giving preferenee to the greatest dj, and if there are ties other still, break them 
by ehoosing randomly. Set t/=max{ry, t}, R<=R\ {/}; 

Step4. IfR = stop. Otherwise, set t = max{t, -i- dj, min{r,; i&R}}, go to Step 2.. 

Where Jis the disturbanee eoeffieient with value of [3 V« /21, and n is the num- 
ber of nodes in F*. It is easy to know that the eomplexity of DS is 0(n^). 

As a matter of faet, the tabu list L is one of the eomponents of the neighborhood 
stmeture of TS, and the value of the length I is an important parameter. Nowoeki im- 
plements a fixed value of / = 8 [16], and Pezzella adopts a variable value of I from 
InIZ to 2n [17]. Sinee any implementation of TS is problem oriented and needs par- 
tieular definitions of values of tuning parameters sueh as I and level of aspiration [17], 
/ is a semi-variable value of L(n+m)/2j in our experiments. Beeause of eoneeming 
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with n and m, this is a new way to determine the value of 1. In TSISB, the way of up- 
dating the Tabu list is not the same as that of Nowocki’s procedure. Especially, when 
N*=(/), but N{0‘') (j), TSISB selects the oldest tabu move while not repeating the latest 

tabu move. 

3.4 The Intensification and Diversification 

Recently, TS is improved by aspiration criteria, intensification and diversification 
[12], which is to improve the effectiveness and efficiency of TS. However, our strate- 
gies of intensification and diversification are different from those of other authors. 

Intensification strategy is to make the algorithm search around some smart solu- 
tions. TSISB implements this strategy not by back jumping scheme in the procedure 
of Nowocki et al. but by setting a quite large value of the up-bound of iterations. This 
up-bound is denoted as Maxiter. TSISB does not change the local search strategy of 
TS until it does not improve the best solution obtained so far within Maxiter itera- 
tions. Usually, the larger the Maxiter is the better the quality of solutions is. However, 
a larger Maxiter needs more computating time and the quality of solutions may not be 
improved indefinitely. In one word, the intensification procedure is just the local 
search procedure of TS, Steps 2,3,4 as described in §3.2. 

On the other hand, diversification strategy is to make the algorithm search in dif- 
ferent regions of the solution space, and these regions are far from each other. Ac- 
cording to the literatures, the enough large number of iterations is used to differ these 
regions from each other [12]. To direct the search to different regions, TSISB imple- 
ments the local re-optimization procedure of ISB after Maxiter steps of the local 
search of TS are implemented. This local re-optimization is the procedure of Step2 in 
ISB, where Mt^M, and it does not stop until there is no improvement during a full cy- 
cle. In fact, from a feasible solution, the re-optimization procedure is also a local 
search procedure whose neighborhood consists of the swap of any two operations 
processed by the same machine[5]. It is clear that the local re-optimization procedure 
with a low complexity is very different from the local search procedure of TS, which 
makes our diversification efficient and effective. In other words, once this strategy is 
implemented, usually the local search can really arrive at a new region. Further more, 
this diversification is different from those in literatures [3,6,7,17], one important rea- 
son is that the constraint of tabu list is ignored while implementing this diversification 
procedure. To get good solutions in a moderate period, the time of implementing di- 
versification strategy is less than Maxt. 

Let T denotes the time that this strategy is implemented, the main steps of TSISB 
is described as follows: 

Stepl. Get initial solution 6*by ISB, and set 6*^ 6, 7=0 and r=0; 

Step2. 7=7-1- 1, and generates the subset N* of N{0} such that either the applied 
move does not belong to the tabu list or at least one of the aspiration criterions satis- 
fied; 

Step3. Choose a best solution 6&]St according to the objective function C(») of the 
problem; 

Step4. If C{0)<C{6‘'), then set 6= 6*. Update the Tabu list and the aspiration crite- 
rion; 
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steps. If 0* is optimal or equal to the lower bound, then stop. C{6*), set / 

= 0, and go to Step2.; 

Step6. If 7 < Maxiter, go to Step2. . Otherwise, T^T+ \ and implement the re- 
optimization procedure; 

Step?. HT<Maxt, set 7=0 and go to Step2., Otherwise, stop. 



4. Computational Results 

TSISB is implemented in C language on personal computer Pentium 166MHz. The 
algorithm has been tested on 88 problem instances of various sizes and hardness level 
provided by OR-Library (http:// mscmga. Ms.ic.ac.uk/info.html) classed as following: 

(a) Three instances FT6, FTIO, FT20 due to Fisher and Thompson with 
«xm=6x6, 10x10, 20x5, and five instances ABZ5-9 due to Adams et al. with two 
«xm=10xl0 and three «xm=20xl5. 

(b) Eighty instances of eight different sizes (nxm = 15x15, 20x15, 20x20, 
30x15, 30x20, 50x15, 50x20, 100x20) denoted as TDl-80. This class contains “par- 
tially hard ” cases selected by Tailard among a large number of randomly generated 
instances [18]. The optimal solution is known only for 33 out of 80 instances. 

TSISB is compared with all latest procedures for which we can find results 
(make-span, CPU time) in the literatures. The following notations are for those proce- 
dures: NS(3) gives the outcome of over three runs stand for the tabu search procedure 
of Nowicki and Smutniki; TD stands for the taboo search procedure of Taillard [18]. 
TSSB stands for a Tabu search procedure of Pezzella and Merelli [17]. SB-GLSl, SB- 
RGLS5,10 stand for three of the twelve guided local search procedures of Balas and 
Vazacopoulos[5]; and '&W-best stands for the best solution of these 12 procedures. 



Table 1. Comparison with TSSB on instances (a) 



Problem 


n 


m 


OPT 

(UB LB) 




TSISB 






TSSB 




UB 


RE 


Time 


UB 


RE 


Time 


FT6 


6 


6 


55 


55 


0.00 


- 


55 


0.00 


- 


FTIO 


10 


10 


930 


930 


0.00 


200 


930 


0.00 


80 


FT20 


20 


5 


1165 


1165 


0.00 


73.2 


1165 


0.00 


115 


ABZ5 


10 


10 


1234 


1234 


0.00 


9.0 


1234 


0.00 


75 


ABZ6 


10 


10 


943 


943 


0.00 


231 


943 


0.00 


80 


ABZ7 


20 


15 


656 


665 


1.37 


2028 


666 


1.52 


200 


ABZ8 


20 


15 


(645 669) 


671 


4.03 


2196 


678 


5.12 


205 


ABZ9 


20 


15 


(661 679) 


686 


3.64 


2724 


693 


4.84 


195 


MRE 










1.13 






1.44 
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The best lower bound (LB) for the problem is taken from [17]. The relative error 
RE (%) is ealculated for eaeh proeedure and eaeh instanee, i.e. the pereentage by 
whieh the solution obtained is above the LB, 100((UB-LB)/LB), and MRE means the 
mean relative error. The Time stands for eomputer independent CPU times that are 
based on Dongarra [7], as interpreted by Vaessens et al. [19]. In our experiments, 
when m<\5 or m> 15, the parameters Maxiter gets the value 8000 and 12000 respee- 
tively; when m< 10, m = 15 or m > 15, Maxt gets the values 5, 10 and 15 respeetively. 
The optimal solution and the lower bound for the stop criteria are equal to the LB. 

Tablet compares TSISB with TSSB on instances (a). This class of instances in- 
cludes the notorious FTIO (10x10) due to Fisher and Thompson , and it takes TSISB 
a quite reasonable period to obtain its optimal solution. Both of these algorithms find 
the optimal solution of five instances except three hard instances ABZ7,8,9, and the 
optimal solutions of ABZ8,9 are not known yet. Because the parameter Maxt is set as 
10, TSISB makes greater efforts than TSSB to compute the three instances ABZ7-9, 
however, the MER of TSISB is smaller than that of TSSB. 



Table 2. Comparison with other 7 algorithms on the instances of TD 1-50 



Problem 

Class 


n 


m 


TD 


NS(3) 


SB- 

GLSl 


SB- 

RGLS5 


SB- 

RGLSIO 


BV-6e.st 


TSSB 


TSISB 


TDl-10 


15 


15 


1.60 


2.41 


2.24 


1.32 


1.25 


1.16 


1.45 


1.32 








- 


(203) 


(57) 


- 


- 


(1498) 


(2175) 


(1097) 


TDll-20 


20 


15 


4.52 


5.46 


6.18 


4.17 


4.00 


3.67 


4.13 


4.04 








- 


(271) 


(113) 


- 


- 


(4559) 


(2526) 


(2232) 


TD21-30 


20 


20 


6.67 


7.95 


8.12 


6.70 


6.56 


6.10 


6.52 


6.38 








- 


(361) 


(165) 


- 


- 


(6850) 


(34910) 


(6644) 


TD31-40 


30 


15 


2.43 


3.05 


3.53 


1.49 


1.30 


0.79 


1.92 


1.34 








- 


(407) 


(175) 


- 


- 


(8491) 


(14133) 


(4101) 


TD41-50 


30 


20 


6.32 


8.34 


8.50 


5.86 


5.73 


5.20 


6.04 


5.70 








- 


(542) 


(421) 


- 


- 


(16018) 


(11512) 


(17784) 


MRE 






4.31 


5.44 


5.71 


3.91 


3.77 


2.65 


4.01 


3.76 



The average computing time for each class is in the parenthesis, - means not reported 



Next, the 80 instances TDl-80 are computed. Among these instances, about 30 
are easy because the number of jobs are several times that of machines [5], and the 
other 50 ones TDl-50 are hard not only because the number of their jobs and ma- 
chines are almost same but also because their quite large sizes. The number of opera- 
tions of these 50 instances is between 225 and 600, and these instances are divided 
into 5 classes according their sizes. Table2 gives the MRE for each of the class and 
the MRE of all these instances. Three of the 8 algorithms did not report their comput- 
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ing time. Not only on the MRE but also on the eomputing time, TSISB has the best 
performanee, espeeially when the number of maehines is equal to 20. It is eontribute 
to several faetors, sueh as, the produeer for the initial solution, the diversification 
strategy and the new neighborhood in our algorithm. 

TSISB has got the 28 optimal solutions out of the 30 easy instanees exeept for TD62 
and TD67 (nxm = 50x20) whose optimal solutions are not found yet. However, it ob- 
tains the best solutions of these two instanees with the make-span of 2826 and 2879, 
respeetively. Furthermore, the value of 2879 is the lowest up-bound so far of instanee 
TD62 whose eomputing time is 8334 seeonds, and the average eomputing time of 
TSISB on the 30 instanees are mueh less than that of both TSSB and HV-best. In de- 
tails, the average eomputing time for these three elasses instanees is TD51-60: 70.6 
seeonds; TD61-70: 2296 seeonds and TD71-80: 186 seeonds. 



5, Conclusion 

The new heuristie algorithm TSISB that is based on the TS teehnique and the im- 
proved shifting bottleneek proeedure turns out to be effeetive and effieient. It gets ini- 
tial solution with a new proeedure ISB. TSISB implements TS proeedure in a very na- 
ture way and improves the intensification and diversification strategies of TS by using 
both the loeal seareh proeedure of TS and the loeal re-optimization proeedure of ISB 
in turn. In the loeal seareh proeedure of TS, TSISB adopts new neighborhood stme- 
tures and exeeutes the parameters S, /, Maxite and Maxt in a simple manner. 

The eomputational experiments show that TSISB is better than TSSB. TSISB 
performs better than SB-RGLS5,10 on the instanees TDl-80. Espeeially, within a 
moderate period, TSISB has found a lower up-bound than them for the instanee TD62 
with a quite large size of nxm = 50x20. It ean eonelude that TSISB is robust, effeetive 
and effieient algorithm for its performanee on the 88 benehmarks. Further more, other 
loeal seareh proeedure and parallel algorithm ean benefit from the way of implement- 
ing of diversification strategy in TSISB. 
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Abstract. There exist a large number of computational resources in the domain 
of scientific and engineering computation. They are distributed, heterogeneous 
and often too restricted in computability for oneself to satisfy the requirement 
of modern scientific problems. To address this challenge, this paper proposes a 
component-based architecture for managing and accessing legacy applications 
on the computational grid. It automatically schedules legacies with domain 
expertise, and coordinates them to serve large-scale scientific computation. A 
prototype has been implemented to evaluate the architecture. 



1 Introduction 

It is well known that there exist a large number of legaey applieations in almost every 
seientifie and engineering domain. They are unehangeable and too valuable to be 
given up. Eaeh alone is very restrieted in eomputability, due to both the target 
platform’s limitation and the programming eomplexity. However, some of them are 
eomplementary in funetion and resolvable problem eharaeteristies, and others are 
eompatible. It is doubtless that their aggregation is almost powerful enough to solve 
every problem reliably and effieiently. The idea of eoordinating these legaey 
applieations and their target platforms with Grid teehnologies [1, 2] to solve large- 
seale and eomplex seientifie problems is straightforward, and the advantages are 
obvious. However, despite the teehnieal advanees of Grid eomputing in reeent years, 
this kind of eoordinative eomputing remains a grand challenge. 

We have attempted to devise an application framework AOD for coordinating and 
scheduling distributed legacy applications on the computational grid, so as to with. It 
is component-based and built on top of OGSA [2], supporting distributed but 
complementary legacies to be selected dynamically for solving large-scale complex 
scientific problems in a cooperative way. Our ultimate goal is to equip the 
computational grid with the mechanisms for coordinating and scheduling legacies 
automatically, and hence to create an on-demand computing environment. In this 
environment, legacies are augmented with domain expertise and abstracted as 
consistent services for performing specific computation on the computational grid. 



' This work was supported by National Natural Science Foundation of China (No. 60303001, 
No.60 173004) 
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Every service is automatically implemented with a collection of complementary and 
competitive legacies. These services are self-optimizing, self-healing and adaptive to 
problem characteristics. A grid application is a set services connected with each other 
by directed edges, and AOD provides the mechanisms for executing it on the 
computational grid. 

The next section presents an approach for automatically managing and accessing 
legacy applications on the computational grid, and discusses the mechanisms for 
coordinating them to solve large-scale scientific problems. Section 3 introduces a 
prototype of AOD. Related works are overviewed in section 4, followed by a 
conclusion of this paper. 

2 A Grid Environment for Large-Scale Scientific Computation 

To manage and access legacy applications on the computational grid, we have 
proposed the concept of grid-programming component that provides an approach for 
incorporating domain expertise into the grid environment. A grid-programming 
component (GP component) is an autonomic and extensible entity exiting on the 
computational grid, aggregating a collection of legacies augmented with necessary 
domain expertise, and providing a set of functions for developing grid applications. 
Every function implies some kind of computation with its domain-customized 
interface, and is automatically implemented with the legacies. They are self-healing 
for failures occurred on the computational grid, self-optimizing according to problem 
characteristics and dynamic statuses of grid resources. We call every function as an 
on-demand computing service (OD service) of the grid-programming component. 

Every GP component uses a generic configuration framework to specify its 
underlying computational resources and the augmented expertise. When it is 
registered, the configuration is interpreted by AOD to configure and implement its 
OD services on the computational grid. The configuration declares a list of lO ports 
and OD services as the GP component’s interface. The lO ports are used by the OD 
services to input and output their arguments. Every port input or output one type of 
data objects in files, and specifies every transferred file’s syntax and semantic in 
domain terms. Generally, an OD service has more than one candidate implementation. 
Every candidate is provided by some local platform independently, involving one or 
more legacies installed on the platform. Different candidates may differ in efficiency 
and resolvable problem characteristics. The configuration not only specifies an 
executing scheme of every candidate’s underlying resources, but also details three 
kinds of domain expertise for dynamically selecting candidate. One is the annotation 
of every candidate’s applicability to problem characteristics. The second is the 
methods for querying dynamic statuses of every candidate’s underlying resources. 
And the third is the methods for detecting automatically a problem’s characteristics 
from its data. With the expertise and the support of Grid middleware like Globus 
Toolkit [3], an OD service dynamically selects one optimal candidate to execute and 
complete the desired computation when it is invoked. 

Based on the concept of GP component, AOD provides a Grid environment for 
combining distributed legacy resources dynamically to serve large-scale scientific 
computation. Every legacy in AOD is encapsulated in some GP component. In grid 
applications, a complex problem is divided into several concurrent and relatively 
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simple sub-problems, and every sub-problem is specified with a reference to some 
OD service. These references are connected with directed edges to specify the 
problem domain’s concurrency. When the application is submitted to run, AOD will 
automatically create a formal sub-problem description for every reference, according 
to the application’s arguments and the connected edges. The referred OD services are 
invoked concurrently, and each is provided a formal sub-problem description. AOD is 
also responsible for transferring data objects and communicating messages for the 
invoked OD services on the computational grid. 

AOD consists of repository, scheduler and broker, and is built on top of Grid 
middleware for OGSA. The repository is responsible for configuring and managing 
all registered GP components and their OD services on the computational grid. It also 
provides an environment for every OD service to select and schedule its underlying 
computational resources. An OD service’s behaviors on any underlying host are 
conducted by the local broker instance. The scheduler automatically invokes and 
synchronizes concurrent OD services when an application is executing. 



3 Implementation and Experiment 

Based on Globus Toolkit, we have implemented a prototype for AOD. In this 
prototype, the scheduler, the repository and every broker instance exchange have been 
assigned a local TCP/IP port respectively, in order to receive messages real-time 
messages with GlobusIO, so as to invoke OD services, perform computation and 
synchronize concurrent OD services. The local broker instances on every host are 
managed with GRAM. When an OD service is invoked, its arguments are transferred 
on the computational grid with GridFTP. The prototype provides three XML schemas 
for programmers. The first schema is for domain experts to define FC descriptors. The 
second one is for developing the configurations of GP components, and the last one is 
for developing grid applications. We also have developed a tool for mnning grid 
applications with Internet browsers. 



Table 1. Experimental Result of a Demonstrative Example 



GP component 


computing host 


working directory 


Start 

time 


End 

time 


preProc 


162.105.203.100 


/home/chen/lyan/test 1/ 


16:43:41 


16:45:39 


voiFilt 


162.105.203.100 


/home/ chen/ly an/ part 1 / 


16:45:47 


16:46:58 


qCom 


162.105.203.38 


/home/aitest/oil/part2/ 


16:45:47 


16:48:16 


Synth 


162.105.80.17 


/home/glohus/lyan/test2/ 


16:48:31 


16:55:26 



Table 1 is the experimental result of a demonstrative example performed on the 
prototype. The example consists of four GP components prePrc, voiFilt, qCom and 
Synth, performing some kind of simplified pre-stack migration for oil-prospecting 
data processing. Every GP component provides one OD service with the 
corresponding legacy application and several additional executables. prePrc accepts 
the primal sampling data, and its result is passed to voiFilt and qCom respectively. 
Synth creates the final result by synthesizing the results of voiFilt and qCom. The 
experimental sampling data is about 2 GB, consisting of two binary data fdes. 
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4 Related Works and Conclusion 

In recent years, the challenge of developing grid applications has been investigated 
extensively. The OGSA is the first effort to standardize Grid functionality and 
produce a Grid programming model consistent with trends in the commercial sector. 
It integrates Grid and Web services concepts and technologies. In this architecture, 
resources are encapsulated to be Grid services [2, 4] with standard interfaces and 
behaviors. XCAT [5, 6] and ICENI [7] attempt to build an application component 
framework on top of OGSA for distributed computation, and support grid applications 
that require the collaboration of different Grid services. Neither OGSA nor XCAT 
takes account of complementary or competitive resources in resource scheduling. 
ICENI seeks to annotate the programmatic interfaces of Grid services using WEB 
Ontology Language, allowing syntactically different but semantically equivalent 
services to be autonomously adapted and substituted. 

AOD provides the mechanism for scheduling complementary and competitive 
resources universally, according to problem characteristics and dynamic resource 
statuses. It abstracts distributed and heterogeneous computational resources to be 
services that are self-healing for failures occurred on the computational grid and 
adaptive to both problem characteristics and dynamic statuses of computational 
resources, and supports multiple services to serve complex scientific problems 
collaboratively. This kind of resource managing strategy not only improves the 
dependability and efficiency of grid computing, but also simplifies the complexity of 
developing grid applications. We are going to replace current candidate 
implementations of OD services with Grid/Web services, so as to simplify the 
complexity of developing GP components. 
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1 Introduction 

As Grid and P2P computing become more and more popular, many schedule 
algorithms based on economics rather than traditional pure computing theory 
have been proposed. Such algorithms mainly concern balancing resource supply 
and resource demand via economic measures. As we know, fairness and efficiency 
are two conflicting goals. In this paper, we argue that overbooking resources can 
greatly improve usage rates of resources and simultaneously reduce responsive 
time of tasks by shortening schedule time especially under extreme overload, 
while maintaining schedule principles. This is accomplished by scheduling more 
eligible tasks above resource capacity in advance, therefore overbooking the re- 
source. We verify our claim on Grid Market [1]. 

2 Enhance Grid Market with Overbooking 

2.1 Grid Market[l] 

There are two types of participants in the market: resource suppliers and re- 
source consumers. Suppliers compete to sell resources while consumers contend 
to buy resources. There exists a market transaction center where all orders from 
suppliers and consumers are matched based on order type and match algorithm. 
Running pricing algorithms, software agents on behalf of suppliers and consumers 
automatically post orders. 

Resource market periodically uses price-driven continuous double auction 
process to match consumers’ bid orders (buy orders) and suppliers’ ask orders 
(sell orders). Bid orders and ask orders can be submitted at anytime during 
the trading duration. At the end of a match period, if there are open bids and 
asks that match or are compatible in terms of price and requirements, a trade is 
executed. Bids are ranked from highest to lowest according to bid prices while 
asks are sorted from lowest price to highest price. The match process starts from 

* This project is supported by the National Natural Science Foundation of China 
under Grant No. 60373004, No. 60373005, and No. 60273007, and by the National 
High Technology Development Program of China under Grant No. 2002AA104580 
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the beginning of ranked bids and asks. If prices are equal, match priorities are 
based on the principle of time first and quantity first. 

Grid Market proposed two pricing algorithms: consumer pricing algorithm 
and supplier pricing algorithm. The consumer pricing function is: Pbid(t) = 
a + f3At, where a, denoting base price, and P, expressing price elasticity, are 
consumer-specific coefficients and t is the time parameter. The supplier function 
is: Pasked{t) = ot — PAt, where a, denoting base price, and P, expressing price 
elasticity, are supplier-specific coefficients and t is the time factor. These two 
functions automatically make temporal differences between bid prices and ask 
prices to converge to clear the market. 

2.2 Overbooking 

Consider such a scenario in a basic Grid Market. A resource is released by a con- 
sumer and is open to compete by potential consumers. If the resource’s sell price 
is higher than any buy prices posted by consumers, the resource is inevitably 
idle until ask price and bid price gravitate towards each other and meet or cross 
finally. The wasted negotiation time may be negligible in most light-loaded cases, 
but it may cause severe service bottleneck in heavy load situation. And the ex- 
isting of idle time decreases resource utility rates and prolongs task’s responsive 
time, which is the sum of negotiation time and service time. 

As a widely used technology, overbooking can improve resource utility rates. 
The basic idea of overbooking goes as follows. A resource can be assigned loads 
more than its stated capacity temporarily and, using priority-based algorithm, 
serves them at its best. So, the resource works in high gear and no resource is 
squandered. Nevertheless, in the long-term the resource’s capability should not 
be overwhelmed by average loads to maintain service quality. 

Therefore, we introduce overbooking into Grid Market. A resource keeps a 
waiting queue in which successfully bidders for it are lined up. A new winning 
consumer will be queued in when a consumer frees the entire/partial resource. 
The length of waiting queue depends on three parameters. The first is the match 
interval. A relative short waiting queue is enough to preserve utility usage rates 
for frequenter matches relative to service time, which can effectively shorten 
necessitated negotiation time. The second is the number of a resource’s servants, 
which determines the number of consumers that can simultaneously served by 
the resource. The number of candidates in the queue must be as many as, or 
more than the number of servants of corresponding resource. Otherwise, until 
succeeding consumers successfully win out, fractional resource released by served 
consumers may be unoccupied. The third is the ratio of negotiation time to 
service time. The higher the ratio is, the longer the waiting queue should be to 
ensure the pipeline’s feeding. While service time is relatively fixed, negotiation 
time relies on several factors in market wide scope: consumers’ starting prices, 
elasticities, and ceilings, and suppliers’ starting prices, elasticities, and floors. 
In common practices, we expect that the ratio of the waiting queue’s length 
to the number of servants of a resource should predominately account for the 
efficiencies and we take 1/1, say the length of the queue is equal to the number of 
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corresponding servants, as a rough estimation for the suitable length. Adaptation 
policy can be applied here. 

3 Analysis and Simnlation 

3.1 Analysis 

The system is modelled as a M/M/N queuing network [6]. Task streams of all 
consumers are bound into a single task stream as system input stream. We em- 
ploys below equations [6] to theoretically analyze the resource utilization rate and 
responsive time of our system: p = . where N consumers and N suppliers 

are the number of consumers and suppliers separatively and p is the system re- 

source usage rate. A = ( ^ )/(X) )> ^ = i~!^k ■ System responsive 

i=o ■ i =0 ' ^ 

time is: Test = F (T^ + 



3.2 Simulation 

We use a event-driven prototype to explore the schedule efficiency of this algo- 
rithm in aspects of task responsive time and resource utilization rate varying 
elasticity coefficients, which should be the primary determinant factor in com- 
mon settings (Figure 1 and Figure 2). To emphasize the algorithm’s performance 
in bad settings, we set match interval to 2 time units, which greatly protracts 
the negotiation duration. 

First, we can see from figures that our schedule algorithm is highly efficient: 
the theoretical curves (plotted according to p and Test respectively) are almost 
approximated by experiment curves when system’s load is not high. Second, 
without overbooking, time burden due to bargaining between consumers and 
suppliers increases sharply as system approaches saturation and the degree of 
increased burden is negatively related to elasticity coefficients. The reason be- 
hind it is straightforward: bargaining time costs are neglectable relative to ’long’ 
arrival interval when system load is light, but it does matter in high load cases. 
These costs reduce resource utilization rates and increase responsive time. Third, 
overbooking greatly improves the algorithm’s performance especially in high load 
environment. With the help of overbooking, responsive time curve and utility 
utilization rate curve all draw near their theoretical curves respectively. Finally, 
the heavier the time burden incurred by participants’ parameters is, the more 
efficient the overbooking is. 

4 Related Work 

Spawn[2] employs Vickrey Auction[3] — second-price sealed auction — to allocate 
resources among bidders. Bidders receive periodical funding and use balance of 
fund to bid for hierarchical resources. Task-farming master program spans and 
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withdraws subtasks depending on its relative balance to its counterparts. Monte 
Carlo simulation applications are its main targets. Rexec/ Anemone [4] imple- 
ments proportional resource sharing in clusters. Users assign utility value to their 
applications and system allocates resources proportionally. Cost requirement is 
not its consideration. In JaWS (Java Web-computing System)[5], machines are 
assigned to applications via auction process in which highest bidder wins out. 
These above solutions don’t make use of continuous double auction. 

5 Conclusion 

With the pervasion of Grid and P2P computing, arises a critical problem, effi- 
ciently and fairly allocating resources especially under extreme overload. In this 
paper, we contend that overbooking resources can greatly improve usage rates 
without disobeying algorithm-specific scheduling principle. Simulation results 
conducted on Grid Market enhanced by overbooking testify these claims. 
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Abstract. The Grid technology enables large-scale sharing and coordinated use 
of networked resources. The kernel of computational Grid is resource sharing 
and cooperating in wide area. In order to obtain better resource sharing and 
cooperating, discovering resource must be efficient. In this paper, we propose a 
Grid resource discovery model that utilizes the flat and fully decentralized P2P 
overlay networks and hierarchical architecture to yield good scalability and 
route performance. Our model adapts efficiently when individual node joins, 
leaves or fails. Both the theoretical analysis and the experimental results show 
that our model is efficient, robust and easy to implement. 



1 Introduction 

The kernel of computational Grid is resource sharing and cooperating in wide area. 
We propose a grid resource discovery model that utilizes the flat decentralized P2P 
overlay networks. P2P overlay networks, such as Chord [1], CAN [2] and Tapestry 
[3], are always used in file-sharing systems in which the discovery result has to 
perfectly mach the request. But resource discovery in Grid are in the absence of a 
naming scheme. GRIP [4] is used to access information about resource providers, 
while the GRRP [4] is used to notify register nodes services of the availability of this 
information. To deal with the problem we combine P2P and hierarchical architecture 
in our model. In our model nodes in Grid can be classified into two types. Register 
nodes are those that do not provide any resource but only manage the nodes that 
provide resource. This mode apply P2P architecture to register nodes, which makes 
the framework of register better scalable than traditional register architecture such as 
centralized register, hierarchical register etc. Resource nodes are the other nodes that 
provide resource and take on a little manage work. 



2 Constructing Register and Resource Provider P2P Network 

The scalability of centralized architecture is bad because the register node is its 
bottleneck. So in our model, we combine P2P overlay network and hierarchical 
architectures. There are two P2P overlay networks. One is register P2P overlay 
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network that consists of register nodes the other is resource provider P2P overlay 
network that is constructed by resource provider nodes. We assume that IP is the 
identifier of node. We can regard IP as a point in a virtual 4-dimensional Cartesian 
coordinate space which is defined as Sa^{(0, 0,0,0), (255, 255, 255, 255)}. We assume 
the 4 axes of Sa are x, y, z, w. The first register node R1 holds space Sa. When the 
second register node R2 joins, Sa is divided into two parts averagely. One parts is 
controlled by R1 and the other is held by R2. The central point of the space controlled 
by R1 is closer to R1 than the other space. R1 records the IP of R2 and space 
controlled by R2 and R2 records IP of R1 and space controlled by R2. In this way the 
neighbor relationship between R1 and R2 sets up. After the register overlay network 
contains m node fRl, R2 , .... Rm], the (m+lf'' register node joins which will split the 
space controlled by node Rn {l<=n<=m} which IP is closest to IP of Rm+1 into two 
parts. 

We assume PI is a resource provider node that IP is (162.146.201.148) and it 
knows the register node R1 (28.18.36.112). Thetr PI sends its GRIP data to Rl. R1 
checks IP of PI and its space then transfer the GRIP data of PI to its neighbor. The 
neighbor of Rl does the same as Rl and Finally the GRIP data is received by R2. 
After R2 receives the GRIP data, it records the static resource and only dynamic 
resource types and sends the feedback to PI. Owning to the dynamic resource 
changes over time, if R2 holds the dynamic resource, it has to refresh dynamic 
resource periodically which consume much R2 resource and result in low scalable 
performance. So we only store dynamic resource types in register nodes. The 
feedback contains the IP of R2, space controlled by PI (in 4-dimensional Cartesian 
coordinate space), spaces controlled by PPs treighbors and static resource and 
dynamic resource types of P7’s neighbors. If there is no neighbor of PI, PI will hold 
the space controlled by R2. Then PI sends its GRIP data to its neighbors and its 
neighbors record the dynamic resource of PL PI will send message to its neighbors to 
refresh the record of its dynamic resource periodically. Thus there are at most 9 nodes 
know the dynamic resource of PI. Here PI registers successfully and PI join the 
resource provider P2P overlay network. In this way, we can construct the resource 
provider P2P overlay network. 



3 The Process of Resource Discovery 

If a client c knows any node in Grid, it can get at least one register node from that 
node. Then c sends request to register node Rl to obtain some resource. After 
receiving the request Rl checks the space controlled by it whether contains the 
resource c requesting. If the space contains the static resource c asking for, Rl tells c 
that the static resource is found and sends the location of the resource to c. Otherwise 
Rl transfers the request to its neighbors and waits for the response. If one of its 
neighbor has that resource, Rl select the neighbor and sends its IP to c, then c resends 
request to the selected neighbor of Rl to ask for resource. If all the neighbors of Rl 
have not the resource, Rl extends the search extent to make more register nodes check 
its resource until at least one register node Rn finds the resource and resource provider 
node Pn which belongs to the space controlled by Rn can provide the resource. 
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If R1 has the dynamic resource c asking for, it randomly select a resource provider 
PI which provides the resource and maximum of the resource is not smaller than c 
requesting, then R1 sends the IP of PI to c. After c receives the feedback, it sends 
message to PI to check the current load of the resource. If the free resource matches 
the request of c, PI accepts the request of c and allocate the free resource to c. 
Otherwise PI use experience-based+random algorithm to transfers the request of c to 
its neighbor (Fig. 1-a). The experience-based+random is as follows: nodes learn from 
experience by recording the requests answered by other nodes. A request is forwarded 
to the peer that answered similar requests previously. If no relevant experience exists, 
the request is forwarded to a randomly chosen node. If R1 has not the dynamic 
resource c asking for, it do the same as the static resource discovery to find a register 
node Rn which contain the dynamic resource and the maximum of the resource is not 
smaller than c requesting (Fig. 1-b). 





Fig. 1. The process of dynamic resource discovery 



4 Experimental Results 

In our experiment, we use GT-ITM models to obtain 2 groups of nodes. One group 
contains 5000 nodes that are used as resource providers and the other group contains 
100 nodes that are regarded as registers. 20 kinds of static resources and 50 kinds of 
dynamic resources are in our simulator. Each kind of static resource has 10 instances 
and every kind of dynamic resource has 10 instances too. These resources are 
allocated randomly for resource providers. 

In our experiment, we investigate the influence of the number of nodes to the 
number of hops. We activate 1000, 2000, 3000, 4000, 5000 resource providers 
respectively. We randomly select 40 resource providers as client to send requests. The 
40 resource providers are divided into 4 groups. Fig. 2 shows that the number of hops 
increases slightly with the number of the computing nodes increasing. However there 
is still some slight disobedience in the curve because the resource which client search 
may be in its local node or neighbor or some near nodes. The four curves are very 
similar that shows our model has fine stability. 
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Fig. 2. Average number of hops per group for different resource provider numbers 



5 Conclusions 

We propose a Grid resource discovery model that utilizes the flat and fully 
decentralized P2P overlay networks and hierarchical architecture to yield good 
scalability and route performance. Register nodes are organized as P2P overlay 
network that removes the single-point failure and improve the performance of 
scalability. The register overlay network do some auxiliary manage work for resource 
provider overlay network, which improve the route performance. Both the theoretical 
analysis and the experimental results show that our model is efficient, robust and easy 
to implement. 
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Abstract. This paper gives a brief introduction to a collaborative process 
execution mechanism for service composition implemented in StarWebService 
system. The main idea is to partition the global process model of composite 
service into local process models for each participant, so as to enable distributed 
control and direct data exchange between participants. It is a novel solution to 
the issues of scalability and autonomy of centralized execution. 



1 Introduction 

With the growing trend of service oriented computing, composition of web services 
has received much interest to support dynamic inter-enterprise application integration. 
Many research plans and projects [1,2,3] are established on this attractive theme. 

A main approach to realize service composition is to orchestrate the constituent 
services according to a high-level business process model. Current strategies mainly 
follow the work of workflow technology and propose centralized composition engine 
to execute the process model [1]. However we believe that service composition 
targets at large scale information systems in open environments, which demands that 
execution efficiency will not decrease with the increase of participant services, 
exchanged data volumes and concurrent requests. A centralize engine will have 
inherent limits in scalability Moreover, centralization conflicts the autonomy of 
business partners and prevents them from accommodating to exceptions in an 
autonomous manner. 

In this paper, we present the collaborative process execution mechanism provided 
by StarWebService that allows a composite service to be carried out on multiple 
nodes. By virtue of distributed scheduling control and direct communications, better 
scalability can be achieved, as well as autonomy of participating services. 



2 StarWebService Overview 

StarWebService is a research project of StarMiddleware Group [5] which provides a 
service-oriented integration infrastructure for enterprise applications. It bridges the 
gap between traditional middleware technologies and web services seamlessly, and 
enables flexible and scalable inter-enterprise integrations by composing services into 
high level business process. 
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Primary components of StarWebService system include: the web service runtime 
environment based on Bus-Container- Service architecture; bi-directional application 
gateways for exporting various middleware -based enterprise resources (like CORBA 
objects, EJBs) as web services, as well as importing web services as specific 
middleware resources; and collaborative service composition engine for cross- 
organizational integrations. The first two are abridged from this paper for space limit. 



3 Collaborative Process Execution 

The overall concept is a strategy of “divide and collaborate”. Process model of a 
composite service is partitioned into several fragments and deployed to several peer 
composition engines. At run time, those engines collaborate with each other and 
schedule the composite service in a decentralized manner. 

3.1 Global Process Model Vs. Local Process Model 

Generally speaking, the process model of composite service captures control flows 
and data flows among activities which map to invocations on different constituent 
services. It is a logical centralized view of the contract between participants 
(including all the constituent services and the composite service itself). So we call it 
global process model (GPM) for the composite service. 

Ideas of partitioning business process to enable distributed execution have been 
proposed in [2,3]. But they divide the process model into fragments consisting of only 
one activity (task), and then distribute them over a set of selected nodes responsible 
for carrying them out. However, we argue that a business process usually involves 
complex conversations within participants who carry out more than one activity. So 
we adopt a relative course-grained policy and group all the activities and interactions 
with respect to one participant together. It’s also presented as a process but only 
consists of activities that are assigned to specific participant. We call it local process 
model (LPM) for the participant in the composition. A composition involving N 
constituent services can be decomposed into N-i-1 LPMs. 

Compared with GPM, LPM only concerns the behavior of a specific participant in 
the context of a composition, such as whom it speaks with, when to perform its 
operations, and how to exchange messages. The logic relations among activities in 
LPMs accord with those in GPM. In fact, a LPM can be understood as the projection 
of GPM on a specific participant. 

Algorithm can be developed to deduce LPMs from GPM. Its main idea is to 
partition activities into several subsets in terms of their performers and then to relate 
them by analyzing their causal and data dependencies in GPM. Explicit activities are 
added to represent communications between LPMs if two activities are data- 
dependent in GPM but belong to different LPMs. Sec. 4 will give an example of GPM 
and generated LPMs. Details of the algorithm will be presents in other papers. 
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3.2 Service Composition Engine 

The service composition engine in StarWebService consists of the following parts: 

- Partition module implements the algorithm that divides the GPM of a 
composite service into LPMs for its participants. 

- Deploy module distributes a process model to composition engine in a target 
node and deploys it in the service runtime environment for execution. 

- Execution module manages the execution of processes by event-driven 
mechanism and maintains contexts for process instances. 

Note that deployed processes are wrapped as services with published WSDL 
documents and can be invoked through SOAP messages. Furthermore, the partition 
module and the deploy module are also implemented as special services deployed in 
the system to enable communication between composition engines. 

3.3 Collaborative Execution with StarWebService 

Instead of executing GPM in a centralized engine, composite service execution with 
StarWebService involves multiple nodes which respectively execute a LPM for the 
composition. The whole procedure can be illustrated as two phrases. 

In the prepare phrase, GPM of the composite service is pre-processed by the 
partition module and LPMs for every participant are generated. After the concrete 
constituent service providers for each participant are decided, their deploy modules 
are invoked to transmit and deploy LPMs to their composition engine. LPM for the 
composite service itself is deployed at the node that wishes to host it, which is called 
master node, others participant nodes. Now it’s ready for execution. 

The master node reacts on the customs’ requests and initiates the composite service. 
Execution module in the master node then load its LPM and schedules activities 
defined in. In most cases, the master node does few things but invokes participants’ 
local processes and then waits for their termination. Sometimes callback interfaces are 
necessary. For the participant nodes, their LPMs are activated as response to 
invocations from the master node. Actions taken on both master and participant nodes 
and communications among them are strictly conducted by their LPMs, which assures 
compatible behaviors with each other. Thus, each of the involved engines schedules a 
part of activities in the composition and collaborates with each other through peer-to- 
peer communication. 



4 An Example 

We refer to the purchase order process example in BPEL4WS specification [4] to 
demonstrate our idea. It is composed of invoice service (IS), shipper service (SS) and 
production scheduling service (PSS), with GPM depicted in the left of Figl. LPMs for 
4 participants in the composition are shown in the right. Take the LPM of the shipper 
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for instance. It specifies the activities happened on the shipper site. It requests the 
shipper information to decide the shipper after the order arrives. Then it notifies the 
invoice service of shipping information in parallel with logistics arrangement. At last 
the shipping schedule is sent to the production scheduling service to end the process. 

Both the GPM and LPM can be specified with existing business process modeling 
languages. BPEL4WS is adopted by StarWebService for process definition and 
engine implementation. Fig 2 shows part of BPEL definition for the shipper’s LPM. 








a:GPM for purdiase order process b:LPM for purchase order process 

c:LPH for invoice service d:LPH for shipper service 

e:LPH for production scheduling service 



1: Peceive Purchase Order 
2: ^itial Price Calculation(IS) 

3: Decide on Shipper (SS) 

4: Ibiitialize Production 
Scheduling(PSS) 

5: Couplete Price Calcvilation(IS) 

6: Arrange Logistics (SS) 

7: Couplete Product Scheduling(PSS) 
8: !Davoice Processing 
S1/S2/S3: send PO to invoice/ shipper/ 
prodxjction schedule 
R1/P2/B3: receive PO 
S4: send Shipinfo to invoice 
R4: receive Shipinfo 
S5: send ShipS^edule to production 
scheduling 

PS: receive ShipSchedule 



Fig. 1. GPM and LPMs of the purchase order process 



<sequ.ence> 

<iBcei/e pa3toeQ3= '"pmchassO ideiPiDaess" 
poifrype= '^puidha^O ideopiDcessPT " 
opeiat±)n= "^ndPO ” contB±ier= "PO "> 
<assign> ...</feissign> 

< invoke paiinei3= "dippingPiiovideQ:'' 
poifrype= "dippingPT " 
cpeQa.t±n= "]©qu.eslSh4pin.g" 
inputC ontainei= "shippingReqnest?' 
onlputC ontainer= "drppinglTfo"> 

<fbw > 

< invoke pailner= "invoicePiDaess" 
portType= "invoiaePmces^T " 
opeiHt±)n= "^ndShppingPrice" 



inputC ontainer= "di:^ing]h±)''/> 
<^qaence> 

<iBcieve pailnei= "dippingPiovider" 
portType= "invoicePiDcessPT " 
cpeiat±)n= "sssndSdieduie" 
containei= "sh:ppingSdiedu]e"/> 

< invoke 

partnei= ''pirxiucticQiSciiedu]eP 2 Dcess''> 
portType= ''piDdudiDnSdTedu.iePiDcessPT "> 
opeiat±)n= "^ndShppingSheduie" 
container "shppingSdiedu]e"/> 
</feeqaenoe> 

</flcw > 

</^qaenoe> 



Fig. 2. Part of BPEL4WS definition for the shipper’s LPM 



5 Conclusions 

We propose a collaborative process execution mechanism for service composition in 
this paper. By distributing the control to multiple nodes and enabling direct 
communication between distributed composition engines, it provides a novel solution 
to scalable and autonomous inter-enterprise integration. 
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Abstract. Without assuming any knowledge of the underlying physical topol- 
ogy, the conventional P2P mechanisms are designed to randomly choose logi- 
cal neighbors, causing a serious topology mismatch problem between the P2P 
overlay network and the underlying physical network. This mismatch problem 
incurs a great stress in the Internet infrastructure and adversely restraints the 
performance gains from the various search or routing techniques. In order to al- 
leviate the mismatch problem, reduce the unnecessary traffic and response time, 
we propose two schemes, namely, location-aware topology matching (LTM) 
and scalable bipartite overlay (SBO) techniques. Both LTM and SBO achieve 
the above goals without bringing any noticeable extra overheads. More-over, 
both techniques are scalable because the P2P over-lay networks are constructed 
in a fully distributed manner where global knowledge of the network is not 
necessary. This paper demonstrates the effectiveness of LTM and SBO, and 
compares the performance of these two approaches through simulation studies. 



1 Introduction 

As an emerging model of communication and computation, peer-to-peer systems are 
currently under intensive study [6, 10, 12, 15, 16]. This paper focuses on unstructured 
P2P systems, such as Gnutella [2] and KaZaA [4], since they are most commonly 
used in today's Internet. File placement is random in these systems, which has no 
correlation with the network topology. The typical search mechanism adopted will 
blindly “flood" a query to the network among peers (such as in Gnutella) or among 
super nodes (such as in KaZaA). The query is broadcasted and relayed until a certain 
criterion is satisfied. If an inquired peer can provide the requested object, a response 
message will be sent back to the source peer along the inverse of the query path. The 
flood mechanism ensures that the query messages can reach as many peers as possible 
within a short period of time in a P2P overlay network. 

Studies in [15] and [14] have indicated that P2P systems, such as FastTrack (in- 
cluding KaZaA and Grokster) [1], Gnutella, and DirectConnect, contribute the largest 
portion of the Internet traffic. Among those P2P traffic, a considerable portion of the 
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load is caused by the inefficient overlay topology and the blind flooding, which also 
makes the unstructured P2P systems far from being scalable [13]. 

Aiming at alleviating the mismatch problem, reducing the unnecessary traffic, and 
addressing the limits of existing solutions, we propose location-aware topology 
matching (LTM) and scalable bipartite overlay (SBO) scheme. In LTM, each peer 
issues a detector in a small region so that the peers receiving the detector can record 
relative delay information. Based on the delay information, a receiver can detect and 
cut most of the inefficient and redundant logical links, and add closer nodes as its 
direct neighbors. SBO takes another approach where Gnutella-like peer-to-peer over- 
lays are optimized by disconnecting redundant connections and choosing physically 
closer nodes as logical neighbors. Our simulation studies reveal that the total traffic 
and response time of the queries can be significantly reduced by both LTM and SBO 
without shrinking the search scope. 

The rest of the paper is organized as follows. Section 2 introduces related work. 
Section 3 discusses unnecessary traffic and topology mismatch problems. Section 4 
outlines the designs of LTM and SBO schemes. Simulation and performance evalua- 
tion of the LTM and SBO are presented in Section 5, and we conclude our work in 
Section 7. 



2 Related Work 

Many efforts have been made to avoid the large volume of unnecessary traffic in- 
curred by the flooding-based search in decentralized unstructured P2P systems. In 
general, three types of approaches have been proposed to improve search efficiency 
in unstructured P2P systems: forwarding-based, cache-based and overlay optimiza- 
tion. The above three different approaches are not exclusive and can be integrated to 
achieve better results. 

In forwarding-based approaches, instead of passing on the query messages to all 
but incoming logical neighbors, a peer selects a subset of its neighbors to relay the 
query. The second approach is cache -based search, which includes data index cach- 
ing and content caching. Centralized P2P systems provide centralized index servers to 
keep indices of shared files of all peers. KaZaA utilizes cooperative super peers, each 
of which is an index server of a subset of peers. Some systems distribute the function 
of keeping indices to all peers [11]. 

The third search strategy is overlay topology optimization, which inspires the work 
we are presenting in this paper. End system multicast, Narada, proposed in [7], con- 
structs shortest-path-spanning trees on top of a rich connected graph. Each tree rooted 
at the corresponding source employs the well-known DVMRP routing algorithm. 
Narada has proven to be a sound overlay system when the number of participants is 
not significant. However, because its system overheads are exponential to the size of 
the multicast group, it is not suitable for the P2P system, which is normally very dy- 
namic and involves a good many nodes crossing a wide area of networks. Recently, 
researchers in [17] have proposed to measure the latency between each peer to multi- 
ple stable Internet servers called “landmarks”. The measured latency can then be 
served to determine the distance between peers. This measurement is conducted in a 
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global P2P domain. In contrast, we choose a completely distributed approach where 
distance measurement is managed in many small regions. As a result, our schemes 
can significantly reduce the network traffic while retaining high accuracy. 



3 Unnecessary Traffic and Topology Mismatch 

In a P2P system, all participating peers form a P2P network over a physical network. 
Maintaining and searching operations of a Gnutella peer are described in [3]. When 
joining a P2P network, a new peer-node gets the IP addresses of a list of existing 
peers from a bootstrapping node. It then attempts to connect itself to these peers as 
their neighbors. Once the new peer gets connected with a P2P network, it will peri- 
odically ping the network connections to obtain the IP addresses of some other peers 
in the network. Unfortunately, the join mechanism specified in a P2P network, the 
dynamics of peer memberships, and the nature of flooding would end up with a mis- 
matched overlay network structure and thus incur a large amount of unnecessary 
traffic [12]. 



An example of topology mismatch is illustrated in Fig. 1, where solid lines repre- 
sent the underlying physical connections and dotted lines denote the overlay connec- 
tions in a Gnutella-like P2P system. For a query message sent along the overlay path 
A^C^B, node B is visited twice. Although B is a peering node, B is first visited as 
a non-peering node when A tries to reach C. Because of the mismatch problem, the 
same message may traverse the same physical links, such as BE, EF and FC in Fig. 1, 
multiple times, causing a large amount of unnecessary traffic and increasing the P2P 
users’ query search latency as well. 

To quantitatively evaluate how serious the topology mismatch problem is in 
Gnutella-like networks, we simulate 1,000,000 queries on different Gnutella-like 
topologies with average number of neighbors being 4, 6, 8 and 10. In this simulation, 
we track the response of each query message to check if the response comes back 
along a mismatched path. We count a path as a mismatched path if a peering node on 
the path has been visited more than once. Result shows more than 70% of the paths 
are suffered from the topology mismatch problem. 




Fig. 1. An example of topology mismatch problem 
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We also have the following observations from the simulation. First, a query may 
be flooded to multiple paths that are merged to the same peer. Second, two neighbor- 
ing peers may forward the same query message to each other before they receive it 
from the other one. In both cases, redundant query messages are generated even 
among logical links. 

Existing studies on overlay optimization connect physically closer nodes as over- 
lay neighbors using different techniques. Flowever, these kinds of approaches may 
destroy the connectivity of the overlay and thus create many isolated islands in the 
P2P system. Therefore they are not feasible in unstructured P2P systems. 



4 LTMandSBO 

Optimizing inefficient overlay topologies can fundamentally improve P2P search 
efficiency. In this section, we present our solutions, LTM and SBO. 



4.1 LTM 

If the system can detect and disconnect the low productive logical connections and 
switch the connection of AC to AB as shown in Fig. 1, the total network traffic could 
be significantly reduced without shrinking the search scope of queries. This is the 
basic principle of our proposed location-aware topology matching technique[8]. Lo- 
cation-aware topology matching consists of three operations: TTL2 detector flooding, 
low productive connection cutting, and source peer probing. 

Based on Gnutella 0.6 P2P protocol, we design a new message type called TTL2- 
detector. In addition to the Gnutella’s unified 23 -byte header for all message types, a 
TTL2-detector message has a message body in two formats. The short format is used 
in the source peer, which contains the source peer’s IP address and the timestamp to 
flood the detector. The long format is used in a one-hop peer that is a direct neighbor 
of the source peer, which includes four fields: Source IP Address, Source Timestamp, 
TTLl IP Address, TTLl Timestamp. The first two fields contain the source IP address 
and the source timestamp obtained from the source peer. The last two fields are the IP 
address of the source peer’s direct neighbor who forwards the detector and the time- 
stamp when forward it. In the message header, the initial TTL value is 2. The payload 
type of the detector can be defined as 0x82. 

Each peer floods a TTL2-detector periodically. We use d(i, S, v) to denote the 
TTL2-detector who has the message ID of i with TTL value of v and is initiated by S. 
We use N(S) to denote the set of direct logical neighbors of S, and use N^(S) to de- 
note the set of peers being two hops away from S. A TTL2-detector can only reach 
peers in N(S) and N^(S). We use network delay between two nodes as a metric for 
measuring the cost between nodes. The clocks in all peers can be synchronized by 
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current techniques in an acceptable accuracy'. By using the TTL2-detector message, a 
peer can compute the cost of the paths to a source peer, and optimizes the topology by 
conducting low production cutting and source peer probing operations. 



4.2 SBO 

Instead of flooding queries to all neighbors, SBO employs an efficient strategy to 
select query forwarding path and logical neighbors [9]. The topology construction 
and optimization of SBO consist of four phases: bootstrapping a new peer, neighbor 
distance probing and reporting, forwarding connections computing, and direct 
neighbor replacement. 

Phase 1: bootstrapping a new peer. When a new peer is joining the P2P system, it 
will randomly take an initial color: red or white. A peer should keep its color until it 
leaves, and again randomly select a color when it rejoins the system. Thus, each peer 
has a color associated with it, and all peers are separated into two groups, red and 
white. In SBO, a bootstrap host will provide the joining peer a list of active peers 
with color information. The joining peer then tries to create connections to the differ- 
ent color peers in the list. In such a way, all the peers form a bipartite overlay, in 
which a red peer will only have white peers as its direct neighbors, and vice versa. 
Phase 2: neighbor distance probing and reporting by white peers. We use net- 
work delay between two peers as a metric for measuring the traffic cost between 
peers. We modify the Limewire implementation of Gnutella 0.6 P2P protocol [3] by 
adding one routing message type for a peer to probe the link cost to its neighbors. 
Each white peer broadcast this message only to its immediate logical neighbors, 
forms a neighbor cost table, and sends this table to all its red neighbors. 






Fig. 2. An example of SBO operations 



' Current implementation of NTP version 4.1.1 in public domain can reach the synchronization 
accuracy down to 7.5 milliseconds [5]. Another approach is to use distance to measure the 
communication cost, such as the number of hops weighted by individual channel bandwidth. 
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Phase 3: forwarding connections compnting by red peers. Based on the obtained 
neighbor eost tables, a minimum spanning tree (MST) ean be built by eaeh red peer, 
sueh as P in fig. 2-(b). Sinee a red peer builds a MST in a two-hop diameter, a white 
peer does not need to build a MST. The thiek lines in the MST are seleeted as for- 
warding eonneetions (FC), while the thin lines are non- forwarding eonneetions 
(NFC). Queries are forwarded only along the FCs. 

Phase 4: direct neighbor replacement by white peers. After phase 3 where a MST 
within two hops distanee is eonstmeted, a red peer P is able to send its queries to all 
the peers within this range. Some white peers beeome non-forwarding neighbors, 
sueh as E in Fig. 2. In this ease, for peer E, P is no longer its neighbor. In the phase of 
direet neighbor replaeement, a non-forwarding neighbor, E, will try to find another 
red peer being two hops away from P to replaee P as its new neighbor. 



5 Performance Evaluation 

To evaluate the effeetiveness of LTM and SBO, we generate both physieal network 
topologies and logieal topologies in our simulation. The physieal topology should 
represent the real topology with Internet eharaeteristies. The logieal topology repre- 
sents the overlay P2P topology built on top of the physieal topology. All P2P nodes 
are in a subset of nodes in the physieal topology. 

In our first simulation, we study the effeetiveness of LTM and SBO in a statie 
P2P environment where the 8,000 peers do not join and leave the system. Figures 3 
and 4 show the traffie eost reduetion of LTM and SBO, respeetively. In these figures, 
the eurve of ‘c„-neigE shows the average traffie eost eaused by a query to eover the 
whole network and the average number of logieal neighbors is denoted as c„. We ean 
see that the traffie eost deereases when LTM and SBO are eondueted multiple times. 
They both reaeh a threshold after several steps of optimization. LTM may reduee 
traffie eost by around 80-85% while SBO reduees traffie eost between 85% and 90%. 
Flowever, LTM eonverges in around 2-3 steps while SBO needs 4-5 steps. The simu- 
lation results in Fig. 5 and Fig. 6 show that LTM reduees response time by more than 
60% in 3 steps but SBO needs 8 steps to reduee 60% of the response time in a statie 
environment. 




LTM optimization (steps) 



Fig. 3. Traffic reduction vs. optimization step in 
LTM 




Fig. 4. Traffic reduction vs. optimization 
step in SBO 
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Fig. 5. Average Response time vs. opt. step in Fig. 6. Average Response time vs. opt. step 
LTM in SBO 




Fig. 7. Average traffic cost comparison of LTM Fig. 8. Average response time comparison 
and SBO in a dynamic P2P environment of LTM and SBO in a dynamic P2P 

environment 

P2P networks are highly dynamic with peers joining and leaving frequently. The 
observations in [15] have shown that over 20% of the logical connections in a P2P 
last 1 minute or less, and around 60% of the IP addresses keep active in FastTrack for 
no more than 10 minutes each time after they join the system. We further evaluate 
the effectiveness of LTM and SBO in dynamic P2P systems. In this simulation, we 
assume that peer average lifetime in a P2P system is 10 minutes; 0.3 queries are is- 
sued by each peer per minute. Fig. 7 shows the average traffic cost per query of 
Gnutella-like P2P systems, LTM enabled Gnutella and SBO enabled Gnutella. Here 
the traffic cost includes all the overhead needed in the optimization steps. SBO and 
LTM drop the average cost by 85% and 80%, respectively. Fig. 8 plots the average 
query response time of each system. With the help of our carefully designed the op- 
timization algorithms, the LTM reduces the response time to 30% and SBO decrease 
the response time to 35%. 
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6 Conclusion 

We have evaluated our proposed LTM and SBO overlay topology mateh algorithms 
in statie as well as dynamie environments. Both sehemes are fully distributed and 
sealable in that eaeh peer ean eonduet the algorithm independently without requesting 
any global knowledge. The other strength of LTM and SBO is that they are eomple- 
mentary to eaehe-based and forwarding-based approaehes so that further improve- 
ments ean be made when deployed together. LTM shows its advantages in eonvergent 
speed but slightly ereates more overhead than SBO. It also demands synehronized 
time among peers, whieh implies that an additional overhead is needed to run a cloek 
synehronization protoeol, sueh as NTP. 
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Abstract. This paper is motivated by the problem of poor searching efficiency 
in decentralized peer-to-peer file-sharing systems. We solve the searching 
problem by considering and modeling the basic trade-off between forwarding 
queries among peers and maintaining lookup tables in peers, so that we can 
utilize optimized lookup table scale to minimize bandwidth consumption, and 
to greatly improve the searching performance under arbitrary system parameters 
and resource constraints (mainly the available bandwidth). Based on the model, 
we design a decentralized peer-to-peer searching strategy, namely the Lookup- 
ring, which provides very efficient keyword searching in high dynamic peer-to- 
peer environments. The simulation results show that Lookup-ring can easily 
support a large-scale system with more than 10® participating peers at a very 
small cost in each peer. 



1. Introduction 

The searching efficiency is a crucial factor for peer-to-peer (P2P) file-sharing systems 
(Napster [1], Gnutella [2], Kazaa [3]). Although centralized indexing is efficient (e.g. 
Napster, [1]), it has inherent defects [6] that research communities and internet users 
turn to decentralized systems, in which searching is perfomied cooperatively by 
forwarding queries among peers and use peers’ lookup tables (containing replication 
of items’ metadata) to find results (e.g. Gnutella [2], KaZaa [3]). Notable 
advancements [3, 4, 5, 6, 11, 13] have been made on decentralized searching to 
improve the performance, however, searching (especially searching by keywords) in 
decentralized P2P system still remains challenging. 

Different from existing approaches which take into account either metadata 
replication [5, 6, 11, 13] or enhanced queries forwarding [4], in this paper we solve 
the problem of decentralized searching by simultaneously considering metadata 
replication and queries, and utilizing optimized lookup tables to minimize bandwidth 
consumption and greatly improve searching performance. Our concept is as follows: 
putting more metadata (e.g. file indices) in peers’ lookup tables makes queries be 
resolved more quickly and reduces bandwidth costs on query forwarding; however. 
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more indices imply that system variations (peers’ joining or departure) will cause 
more corresponding updates for expired metadata and increase bandwidth costs on 
metadata maintenance. So, there is a basic trade-off between queries and metadata 
maintenance, and we model this trade-off to find the optimized scales of peers’ 
lookup tables, so as to minimize total bandwidth consumption or maximize searching 
performance under given environment parameters. In Section II, we propose the 
model to estimate optimized lookup table scales, and find that both bandwidth 
consumption and average searching hops can be reduced to 0(A''‘®) (A^ is the number 
of peers) in comparison with the 0{N) complexity in conventional random walk 
strategy [5]. Based on the model, we propose a decentralized P2P file-sharing system, 
the Lookup-ring, which implements a general searching strategy with nearly optimal 
performance under arbitrary system parameters (system scale, magnitude of shared 
files and frequency for users issuing queries, etc) and resource constraints (mainly the 
bandwidth constraint in peers). In current Internet environment. Lookup-ring can 
easily afford a system with more than 10® peers where most searching queries are 
resolved within a few hops. 

The rest of paper is organized as follows. Section II gives the model. Section III 
presents details of Lookup-ring design. Section IV presents performance evaluation. 
Section V discusses related works and Section VI concludes the paper. 



2. Model for Bandwidth and Trade-Off 

In this section, we propose an analytic model to estimate bandwidth consumption and 
describe the trade-off between querying and metadata maintenance. We first define 
notations in the model (see Table. 1). We consider a system consisting ofVpeers {N is 
around 10®) and sharing U unique files (we don’t count file replicas in U), denoted by 
fufi, ■■■fu- Each unique file may have some replicas shared by users who download 
the file. We use r, to denote the number offj’s replicas, and TR to denotes the total 
replica number (TR = ri+rj+.-.+ru). For system variations, the peers’ average session 
time is denoted by Tsession- Based on measurement works [7, 8] we have referenced 
values of these system parameters, as listed in the Table. 1 (these values are only used 
for reference in the model, not necessary). 

Considering that there are totally k, indices of file fi in all peers’ lookup tables (for 
i-l...{7), we call k, as /j’s “indexing factor'". The search process is a sequence of 
probes: when a peer is probed, it attempts to match the query on its local file indices; 
we assume the searching is perfect and strict, i.e. a query for file f can always and 
only be resolved by a probe to peer containing an index to f. For random search 
process, the search size (number of probed peers) for resolving a query ofyj is a 
random variable, with the expectation equal to V/ k, [13, 4]. 

Now we present the model. First we estimate bandwidth costs for querying. Unique 
files have their respective popularities modeled by query distribution. Let q = < qi, 
qi, . . .qu > be a vector of probability that sum to 1, where qi is the probability that a 
query is for file f. Therefore, q is the query distribution [13, 4]. Considering there are 
totally Q queries submitted per second, the totally bandwidth for querying is: 

^ N 

-^^query “ 

i=\ 



( 1 ) 
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where is the average size of querying message (bits), and (N / krl) is the 
expectation of hops to resolve a query for fi. 

Table 1. Notation and Model Parameters. 



Parameter 


Meanings 


Referenced value 


N 


Number of peers in the system 


10" 


U 


Number of unique files 


10-w= 10’ 




Unique files shared in the system 




r. 


Number of fiS replicas 


Zipf 


TR 


Total replica number. TR - ^ 


200-V=2-10® 


Q 


Number of queries per second 


1/60-V 


T 

^ session 


Average peers’ session time (on-line time) 


1 hour 


7 

^peer 


Poisson parameter for peer variations 


1/3600 


Vpeer 

Vfi,. 


Number of peer variatons per second for both join and 
departure 

Number of file varations per second for both adding and 
removing files 


X -N 

peer 

1.74 X lO'W 


ki 


Number of indices of fi 




q=<qi,...qu> 


Query rate distribution, -1 


L 


m,, Mp 


Average message size for querying and updating expired 
index 


0.5KByte 


R„P 


For redundant messging: peer receives one updating message 
for times. 





Second, we estimate the maintenance costs. When replica variation occurs, we 
need to update the affected indices in lookup tables. So, the bandwidth costs is made 
up of the following parts: depart for updating expired indices pointing to a 

leaving peer; ^Ikpeerjoin for a joining peer downloading its lookup tables from others; 
and BWfiXe for updating lookup tables due to file variations (both sharing new files and 
removing shared files). We use Vp^er and Vf,ie to denote the variation frequency (time 
per second) for peer and file respectively. Peer variation are usually modeled with 
Poisson distribution with parameter Xp^er =l/7’iesrion [14, 19], and we have Vp^er =Xpeer'N 
for both joins and departures. For Vfiie, in [7] we know the largest number of 
successful downloaded files per peer per day is no more than 75 files (a very large 
number), and thus Vpe ~ 2-75/(24 ^ 3600)-/V for both new downloaded files and 
removed files. (The model describes stationary system behavior, so we assume 
number of new files to be approximate equal to deleted files in certain duration.) 

Now we calculate maintenance costs. A failure of file replica invalidates all indices 
pointing to it. These “expired” indices should be updated sooner or later; otherwise 
the total number of valid indices will decrease. We suppose the indexing assignment 
has no preference for replicas with higher availability. Thus, for a unique file ft with r, 
replicas and totally k, indices, one replica failure will averagely cause k/r, expired 
indices. For filFpeer depart, seeing that departure of a peer P causes failures of all its 
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replicas, for file^J with r, replicas the probability that a departing peer P contains is 
rJN. So the expectation of expired indices caused by a peer departure is: 



^ r. L. k- 
expired number — 



N 



( 2 ) 



So, bandwidth consumption for maintaining them is: 



u 

^ Ifpeerdepart “ ‘ ^ peer ‘ ^msg ' 



(3) 



where nip is the average size of updating message. In (3) we use R„sg to denote the 
redundancy factor of messages, which indicates that in order to updating a single 
index, a peer will averagely receive R^sg times of the corresponding updating 
message, each of which has complete updating information (this is defined for non- 
acknowledged messages. For acknowledged message transmission, R„,sg is defined as 
the double value of non-acknowledged case). The Rmsg is a system parameter to 
characterize updating algorithm in specific system. To make update tolerant to 
message lost, some algorithms utilize redundant messaging where peers may receive 
the same message more than once. In the model we use R^sg to reflect this manner. 

A peer loses its lookup table after departure, and should download entire table in 



join time. The average number of indices in lookup table is '' , So: 

u 

^ Ifpeer Join “ ^ ‘ ^ peer ' ^p 



(4) 



For file variations, from above analysis a variation off may generate kjlri-R^sg'i 
bandwidth cost for updating indices, so totally bandwidth is: 






^ k , 

.V — 

^ r.. TR 
;=1 ' 



(5) 



where TR is total replica number (see Table. 1), and the number of variations forf is 
assumed to be proportional tof’s replicas’ number r, (i.e. number of peers containing 
fi). Thus, summate all these cost and we will have the estimation of total bandwidth 
consumption, as follows: 

~ query Ikpggj. ^lepart f^f^peer_io'm (6) 



Notice that {k,} are independent, and we can minimize each term of the summation 
in (6) by choosing the best k,. The optimized choices of k, for minimum 7? lF,otai is: 



k, =N- 



Q 

^ peer ^ ^msg ■{Vp,,,+NITR-V^„f Mp 



■qi = N- 0^ 



(7) 



subject to ki — N (recall that k, is the indexing factor of fAcff where 6* is a system 
parameter independent to i. The minimum 7? IFtotai is: 



BW,, 



= BW, 






0 0 



( 8 ) 



where we used Cauchy inequality to the summation. So, the average bandwidth cost 
in each peer based on k, is: 
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hw 



per peer 



N 0 N\N 



(9) 



From (7) the optimized ki is proportional to the querying rate for each file fi. It is 
clear that ki is a trade-off between querying and maintaining the system, since the 
numerator Q-niq-qi and denominator {Vpeer+Rmsg{Vpeer+NITR-Vfiie))-mp represent cost for 
queries and maintenance respectively. In (9) we have average bandwidth cost per peer. 
The QIN is the number of queries each peer submits to system per second (e.g. 1/60). UIN 
is also a stationary environment parameter (e.g. 10-20 based on [8]). So, the bandwidth 
cost per peer is in optimization. Because of the square root, the scalability of 

system with optimized indexing factor is fairly good. Using practical parameter values in 
(9), we find the random searching strategy becomes surprisingly powerful under 
optimized indexing factors (i.e. optimized lookup table scale). For example, for A^=10® 
peers with only 1 hour session time, using Rmsg=?> (very redundant messaging) and other 
reference values in Table. 1, the optimized strategy can support the heavy queries where 
each peer submit a query per minute, within only 15Kbps bandwidth per peer (both 
upstream and downstream). This is a very low bandwidth cost that modem connections 
can easily afford, and we can even reduce it with more efficient messaging (lower TJ^sg)- 
For comparison, based on report in year 2001 [18], in Gnutella each peer consumes more 
than 150 Kbps bandwidth both upstream and downstream. 

The model illustrates theoretical lower bound of peer consumptions for 
constructing a lookup system based on random searching process. It shows that with 
appropriate lookup scales and updating mechanism, a uniform system (i.e. no 
supemodes) with simple unbiased search is capable to support very large systems. In 
the following sections we give practical design derived from the model. 



3, Design of Lookup-Ring 

This section presents design of Lookup-ring Lookup-ring is derived from the model to 
achieve optimized performance, in which indexing factors is calculated based on 
equation (7). Lookup-ring is built on top of most structured P2P infrastructures, e.g. 
Chord, Pastry [9] and SkipNet [20]. In this paper we illustrate how it works on top of 
Pastry and SkipNet as example. For details of their structures, please refer to [9, 20]. 

3.1 Indexing Factor and File Levels 

Assuming we know the query rate distribution <qi,q 2 ,.. .qu> (due to limited space, we 
do not provide estimation of but only point out it is reasonable to assume qi to be 
proportional to replica number r,), we can obtain best indexing factor kj with (7). We 
first quantify ki into discrete levels, and files whose ki belong to the same level have 
the same actual (quantified) indexing factor kp The indexing factor is quantified into 
m levels with radix 2, i.e. we use a set of m kinds of indexing factor values M={N, 2" 
' -N, 2'^ -N, ...,2“^'" '^ -N] for all indices. For^J with k* , the actual indexing factor ki 
should be the closest 2'^ -N in M to ki , and we caWfi as a ‘y-level” file. 
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3.2 Peerld, Fileld, and Peer Groups 

In Lookup-ring, each peer is assigned with a unique and uniformly distributed peerld. We 
also generate a uniformly distributed fileld for each unique file by hash functions. We use 
peerld to partition peers into groups, and fileld to match unique files with groups. 

Peers are partitioned into hierarchical groups as follows. All peers with the same j- 
bit prefixes in peerlds are united into a y-level group, for A y-level 

group is denoted by the y-bit common prefix of containing peers. The prefix of a 
group is also called the group’s groupid. For example, 010-group is a 3-level group 
with groupid “010”, which consists of all peers with the same 3-bit prefix “010”. Due 
to uniformity of peerlds, a y-level group has approximately 2'^-N peers. 

Each y-level unique file is matched to one y-level group whose groupid equals to j- 
bit prefix of the file’s fileld. If a file is matched to a group, all peers belong to the 
group should contain the file’s index in their lookup tables. Thus, a peer P with peerld 
idp contains indices of all y-level files with filelds sharing idp’s y-bit prefix, for y=0, 
1, ..., m, and a y- level file is indexed by approximately 2'' A peers, as our original 
purpose. Consequently, any query for a y-level file can be resolved by traversing all y- 
level groups, i.e. forwarding query to 2' peers with different y-bit peerld prefixes. If 
doing so, we also obtain all unique files with file level less than y. Therefore, we can 
resolve any query by traveling peers with different (iw-l)-bit prefixes. 

Since Lookup-ring is built on top of stmctured P2P infrastructure, the peer 
partitioning ought to be consonant with underlying peer organization, and peerld 
should have ability to partition peer into groups in DHT organization. If Lookup-ring 
is built on Pastry, we use Pastry’s nodeld as peerld in Lookup-ring, because in 
Pastry’s organization the nodelds plays the role of partitioning nodes into prefex- 
based groups.^ The peer groups and file levels are shown in Fig.l. 



Groups on top of level 

Pastry | 

0 

1 

2 



0-xx 



1-xx 



00-xx 


01 -XX 


10-xx 


11 -XX 



Groups on top of 
SkipNet 

rm-) ('Otl (TD-) ( ff) 




DD 



m-\ 




?-x 



Peers in a group in Pastry ' ?? ) Peers in a group in SkipNet 



Fig. 1. Partition on top of Pastry and SkipNet. 



^ Other P2P infrastructures (e.g. chord) also have similar nodeld playing the partitioning role. For SkipNet, 
note that it is a “bi-id” system whose Nameld indicates proximity between nodes while Numericid 
partition nodes into hierarchical rings. So, we use SkipNet’s Numericid as the peerld in Lookup-ring. 
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3.3 Searching in Looknp-Ring 

Lookup-ring provides searching by keywords and substrings. Each unique fde is 
associated with a “label” for searching, e.g. the filename. Each file index defines a 
match between a unique fde’s label and location of one of the file’s replicas. Thus, a 
fde index contains the fde label, IP of location. We also store the file’s fileld and 
location’s peerld in the index. For 20~30-byte file label, we need only about 64-byte 
fde index. Based on the model, each peer of a 10^-peered system needs to contain 
about lO"* indices, and the lookup table size is no more than 1Mbyte. 

Lookup-ring has a “prefix-traversing” searching strategy. Consider a query q 
submitted in peer Pq- We first check Pq’s lookup table to see whether q could be 
resolved locally. Flere we obtain all 0-level results of q. If q is satisfied (i.e. get 
enough results) we stop searching, otherwise q is forwarded to a peer P\ which has a 
different 1-bit prefix with Pq- From both Pq and P\ we can find all 1 -level results of q, 
because Po and P\ represent all 1-level groups. If still unresolved, we keep up this 
prefix-traversing. In general, before the y-th step q has been forwarded to 2^'*^ peers 
{Po, Pi, ...P2''(/-i)-i} with the peerlds covering all (/-l)-bit prefixes. In the y-th step we 
forward q to another 2^ '* peers namely Pt'Q-i),-- ■Pi'-j-i, so that y-bit peerld prefixes of 
all searched peers (Po, Pi, ...P2y-i} have covered all the y-bit prefixes. After that, we 
have traversed all groups with no more than y levels and found all results whose levels 
are no more than y. The searching process stops either query is resolved or we reach 
the last level ((m-l)-level) when all unique files’ indices has been searched. 

Due to the consistency of Lookup-ring’ peerld with underlying DHT, it is very 
easy to perform prefix-traversing searching, because it is a natural property of most 
DFITs to perform such prefix-traversing [16]. Therefore, searching in Lookup-ring is 
very efficient without redundant query forwarding. 

3.4 “Principle of Logical Locality” for Location Choices 

Because a unique file usually has more than one replica, there is a problem for 
choosing location for file’s indices. For each of the file’s index we choose only one 
replica as the location. We propose our “principle of logical locality” for choosing 
locations of indices and for easy maintenance. 

For a y-level unique fdey{, aliyj’s indices are stored in a matched y- level group g. If 
one offi’s replicas P fails, we need to efficiently update all affected indices in g, i.e. 
indices picking P as fi ’s location. To make maintenance easy and save bandwidth, peers 
in g whose indices of / pick the same location should to be situated in a logical locality 
in g (i.e. a continuous region is id-space), so that we can perform locality-based update 
in which messages are precisely spread to all peers in the affected region that exactly 
“need” the update, while other peers will not receive the message. For this purpose, we 
use “principle of logical locality” to choose location for each index, i.e., when chooses 
the location for an index of a unique file, a peer should always pick the logically 
“closest” replica of that file. In other words, location in peer’s index should always be 
the living replica which is current the closest one to the peer in logical distance. The 
goal of maintenance is to keep this invariance after system variations. 

Similar to the peerld, here the “locality” should also be consonant with underlying 
peer organization to facilitate updating algorithm. In Pastry, both locality and peer 
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portioning is based on Pastry’s nodeld. Thus, we ask eaeh Lookup-ring peer to ehoose 
the repliea whose peerld is eurrently the elosest one. Obviously this design fits all 
fundamental eonsiderations of our design, e.g., loeality-based updating, sinee peers 
ehoosing the same loeation of a file are logieally adjaeent in DHT’s id-spaee. 
However, from Fig.l and Fig.2 the repliea loeations being ehosen in a eertain group 
are not uniformly distributed, sinee peerlds in a group have a eommon prefix and do 
not fill in peerld-spaee where replieas’ peerlds seatter themselves. So, we ean extend 
this approaeh to get better uniformity of ehoosing loeations (this extension is not 
neeessary to Lookup-rings). We first map peerlds of replieas into the group with 
linear transformation before using prineiple of loeality. For a y-level file / matehed to 
a 7 - level group g with y-bit groupid idg, we map all peerlds of/s replieas into g by 
right-shifting them by /bits and add y-bit prefix idg. After this linear transformation, 
replieas’ mapped peerlds are uniformly distributed in g while also keep their primary 
order. Then, peer in g pieks the repliea of / whose mapped peerld is the elosest one. 
Fig.2 shows this mapping, and in Fig.2 the I, II, and III are the three seetions in the 
group (i.e. loeality) whieh eonsist of peers ehoosing repliea a, b, and c as the loeation 
of index, respeetively. When variation oeeurs (e.g. b suddenly fails), peers in seetion 
II should be updated with new repliea loeations. Based on the prineiple, seetion II is 
then divided into F and III’ whieh should update their loeations with a and c, and be 
merged into I and III, respeetively. The boundaries of I’ and III’ ean be determined 
only with peerld of a, b and c (the boundary between F and III’ has the equivalent 
distanee to peerld(a) and peerld(e)). So, after h’s failure we have the following update 
strategy: h’s neighbor repliea a and c find h’s failure (how they find the failure is 
explained in Seetion 3.5), send their loeations and boundaries of F and III’ to two 
eertain peers in F and III’ eorrespondingly (the dashed lines in Fig.2), and these peers 
spread reeeived messages in F and III’ for updating all other peers in the either 
loeality. When b joins there’s a similar proeess: b ealeulates seetion IFs boundaries 
from a’s and c’s loeality and spread updating message in II. ^ 



NamelD space in Pastry NamelD-ring in SkipNet 





Fig. 2. Principle of locality in Lookup-ring, and the updates of indices after variation 



^ In SkipNet the logical locality is defined by Nameld rather than Numericid, because in each SkinNet-ring 
the sequence and neighborhood of peers are indicated by Nameld. So, on top of SkipNet we use 
Namelds of replicas to determine the sections and guide location choosing. Fig.2 shows one group (i.e. a 
SkipNet-ring) and its sections based on replicas’ Namelds. 
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3.5 Maintaining Looknp-Ring 

For maintenance, we should actively detect variations and update affected indices. We 
constract file-ring in Lookup-ring, where all replicas of a unique fde connect to form a 
ring structure and keep connections with heart-beating messages, so that variations can 
be soon detected. After that, the detector generates an appropriate update immediately. 

3.5.1 File-Rings 

A fde-ring is shown in Fig. 3. Consider a certain unique fde f with r replicas stored in 
r peers. These peers are connected into a ring stmcture (fde-ring), ordered with 
logical locality in DHT, namely Pi,P 2 ,.. .Pr (logical locality is defined in last Section). 
Peers participating in a fde-ring should hold the links to its two neighbors 
(predecessor and successor in ring) and send “heart-beating” messages to them every 
Tprohe of time to maintain connectivity. A peer may participate in many file-rings 
according to its shared unique files. 




0 peer in file-ring 



X 



adjacent-set 
failure in file-ring 
re-connection 



notify and update 
affected adjacent- 
set 



Fig. 3. Maintaining file-ring connectivity 



3.5.2 Active Variation Detection and File-Ring Recovery 

Peers periodically probe their file-ring neighbors. To reconnect broken fde-ring, peers 
should be aware of not only its direct neighbor but also some nearby peers, namely 
“adjacent-set”, similar to Pastry’s leaf set [9]. When neighbor fails, a peer can 
reconnect fde-ring with adjacent set. Then, it keeps heart-beating with its new 
neighbors, and also notify nearby peers for updating their adjacent-sets. Fig. 3 shows a 
fde-ring with 8 peers and adjacent-sets. When replica c fails (either peer failure or 
dropping replica), b and d will detect the failure after {Tprobe+Tom) and begin to repair 
fde-ring. Peer b and d first find each other from adjacent-sets and reconnect file -ring 
{b and d exchange adjacent-sets for verification and updating). Then, b sends its new 
adjacent-set to a and h for updating expired adjacent-sets in them, and d also updates 
e’s and fs adjacent-sets. For replica c joining file-ring, a similar procedure is 
performed that b and d receive the joining request, break their interconnection and 
turn to keep the connection with c, and notify a, h and e,f for updating adjacent-sets. 

We set the Tprobe as 60 seconds and keep 16 peers in the adjacent-set. For variation, 
each detector notifies 7 peers in its side, with an acknowledged messaging. 
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3.5.3 Updating Lookup Tables After System Variations 

Replicas in file-ring are ranged by logical locality; therefore the two detectors of 
replica failure are exactly the two replicas whose locations should be used to update 
expired indices. Thus, in Fig. 2 peer a and c will detect b’s failure within {Tprobe+Tgut) 
due to file -ring heartbeat messaging. After that, a and c calculate their respective 
updating sections (i.e. F and III’), and each of them immediately sends an update 
message to one peer in corresponding section for maintaining lookup tables, via 
underlying P2P routing. The update message contains the new location of replica (a 
or c), the fileld of the unique file, and boundaries of section inside of which the 
message should be spread. The peer receiving the updating message then spreads it to 
entire updating section in the file’s group, by way of message broadcasting algorithm 
in the underlying structured peer organization. 

In detail, most DHTs can perform locality-based message broadcasting as a basic 
service, i.e. broadcasting messages to all peers in a consecutive section based on its 
logical locality in a partitioned group [16]. Lookup-ring utilizes DHT-based 
broadcasting for spreading its update messages, following the algorithm proposed in 
[16]. On top of Pastry (and Chord, etc), we can derive from the routing tables a 
spanning tree for an arbitrary nodeld section (rooted by any peer in the section). Via 
broadcasting the first peer can spread the update information to all other M peers in 
the section through exactly (M-1) messages [16]. This broadcast has no message 
redundancy that each peer receives the needed message exactly once. So, the R^sg in 
the model (see Section.2) should be 1. To further guarantee the update, we use 
confirmed messaging that all updating messages should be acknowledged. If an 
acknowledgement is not received within a timeout period the message is 
retransmitted. Therefore, considering both acknowledgement and redundancy during 
broadcast, the message redundancy factor R^^g in our model should be 2 for Pastry."* 



4, Performance Evaluation 

We perform our evaluation of Lookup-ring with simulations. We run our simulator on 
Linux running on Pentium IV CPU with 2G memory, which can support more lO"* 
simulated peers. We construct and evaluate Lookup-ring on top of SkipNet. We 
implement SkipNet based on [20], using basic type of SkipNet with only R-table and 
density parameter k equal to 2. For shared files, the unique file number is 10 times of 
the peer number and the total file number is 200 times of peer number, derived from 
[8, 7]. The simulation has two aspects. First we examined feasibility and efficiency of 
Lookup-ring by simulating environments with different peer numbers and peer 
availabilities, in order to see bandwidth cost in each peer to support a heavy query 
load (one query per peer per minute). Second, we compared Lookup-ring with 
random walks searching [5] in order to see how much improvement we have gained. 
We are mostly concerned about the following two metrics: the average search size (in 
hop number), and the maximum query workload under a fixed bandwidth. The former 
indicates how quickly a query is resolved, and the latter shows system scalability. 



"* On top of SkipNet there’s a slight difference that the derived spanning tree has some redundancy, where 
each peer will averagely receive an updating message for 1.5 times, and should be 3 for confirmed 
messaging. Here we omit the discussions and readers can refer to [12] for details. 
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Fig. 4. a shows the messages and bandwidth consumptions for each peer in 
supporting one query per peer per minute, under different peer availability (Tsession) 
and system scale (number of total peers). If and is both 1Kbit message on 
average, the values in y-axis of Fig. 4. a is also the needed bandwidth for each peer. 
We can see the trend of bandwidth consumptions when enlarge system scale, which is 
roughly in proportion with the square root of peer number, e.g. when expand peer 
number for 10 times from 10^ to lO"^, the bandwidth increase from 0.36 to 1.42 Kbps 
(i.e. 3.9 times, nearly lO'^^) for 1.5 hours online time (F,awon=5400s). From this trend, 
we can deduce that for 10® peers, the bandwidth is nearly 10 times of the case with 
lO"* peers, i.e. nearly 16Kbps for 7’se^j,o„=3600s and 12Kbps for 7’je^s,o„=7200s, in both 
upstream and downstream. This result shows very good scalability for large systems. 

Fig.4.b is the average search size of Lookup-ring and random walk. The results 
demonstrate the improvement of search size in our strategy. In lO"* peers, the search size 
is only 1/40 of random walk. This outperforming becomes more remarkable when peer 
number N grows, since we have search size while random walk is nearly 0{N). 

Fig.4.c is the comparison of maximum supported query workload under different 
bandwidth. We compare Lookup-ring with random walk in system of 5000 and 10000 
peers (rs(,j.5o„=3600s), with 1Kbit querying messages. From the results we also see that 
Lookup-ring greatly overcome the Gnutella-like system, esp. when system scale 
grows. The reason is because by using adaptive indices, we significantly save query 
hops and simultaneously constrain the maintenance cost to a low level. 




Fig. 4. Performance evaluation, a) up-left, b) up-right, c). bottom. 
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5. Related Works 

To improve searching efficiency, researches try to exploit all aspects of typical query- 
based decentralized searching. In strategy of forwarding queries, [5] propose to 
replace flooding-based query-forwarding with random walks, so that network traffic 
is reduced. [4] further exploit data correlations and user interests to guide forwarding 
directions and improve searching performance. Instead of [4], Lookup-ring doesn’t 
need specific data correlations, and thus is suitable for more applications. In the 
aspect of local lookup tables, results caching [11] and supemode [3] are employed. 
[13] suggests replicating files in accordance with their query rates, so that the 
expectation of searching size is optimized. In comparison. Lookup-ring has fully 
controlled and optimized caching (the indices), and doesn’t need supemode. Recently, 
researchers present to employ biased overlay topology towards peers with larger 
lookup tables, and Gia in [6] is an integrative design combining many above features. 
For DHT-based approaches, most DHTs support only precise search with precise 
resource ID, while the others have very limited capability in keyword search [6, 17]. 
Lookup-ring uses DHT as underlying organization for system maintenance, and the 
efficient keyword search is built on a higher level. 



6, Conclusions 

Our contribution is in the following aspects. First, we propose an analytic model to 
describe trade-off between query and maintenance, based on which the optimized 
lookup table scales can be estimated. Second, we design a efficient decentralized P2P 
searching strategy, where there are no supemodes and all peers are utilized uniformly. 
Third, we demonstrate the maximum query load and system scale that an unbiased 
decentralized P2P system can support. We show that unbiased decentralized P2P 
system can achieve a heave query load in a large-scale system, with low peer costs. 
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Abstract. In layered peer-to-peer streaming, as the bandwidth and 
data availability (number of layers received) of each peer node are con- 
strained and heterogeneous, the media server can still become overloaded, 
and the streaming qualities of the receiving nodes may also be con- 
strained for the limited number of supplying peers. This paper identifies 
these issues and presents two layer allocation algorithms, which aim at 
the two scenarios of the available bandwidth between the media server 
and the receiving node, to reach either or two of the goals: 1) minimize 
the resource consumption of the media server, 2) maximize the stream- 
ing qualities of the receiving nodes. Simulation results demonstrate the 
efficiencies of the proposed algorithms. 



1 Introductions 

Peer-to-peer streaming is cost-effective for it can capitalize the resources of peer 
nodes to provide service to other receivers. In general, there exist three prop- 
erties on peers: 1) peer’s outbound bandwidth or the bandwidth it willing to 
contribute is limited, there always need multiple supplying peers cooperate to 
serve a requesting node, 2) different peer nodes can receive and process different 
levels of streaming qualities, which means their data availabilities as supplying 
peers are also heterogeneous and constrained, 3) peers’ inbound and outbound 
bandwidths are also heterogeneous. Layered peer-to-peer streaming has the po- 
tential to address the issues of heterogeneities, but under it there also exist some 
factors such as the available outbound bandwidths of peers are less than their 
inbound bandwidths, the number of supplying peers is limited to the receiver 
sometimes, the data availabilities(number of layers received) are constrained and 
so on, these factors can lead two results, 1) the media streaming server can still 
become overloaded when the system is in large scale, 2) the receiver’s streaming 
quality may also be constrained with limited heterogeneous supplying nodes. 

We identify these issues and present a layer allocation framework which con- 
tains two layer allocation algorithms. Concerning the available bandwidth be- 
tween the media server and a receiver, there exist two scenarios, 1) sufficient 
available bandwidth, which means all the expected layers of the receiver can 
be solely provided by the media server, and a layer allocation algorithm is pre- 
sented to minimize the resource consumption of the media server while satis- 
fying the streaming quality of the receiver, 2) insufficient available bandwidth. 
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which means the receiver’s expected layers can not solely be provided by the 
media server, and another layer allocation algorithm is presented to maximize 
the streaming quality of the receiver. Either of the algorithms is executed on the 
receiver side, and we only focus on the situation that a receiver has one or more 
candidate supplying peer nodes besides the media server, otherwise the receiver’s 
streaming quality will only be determined by the available bandwidth between 
the server and the receiver, here we don’t discuss it for the simplicity. In addi- 
tion, we suppose that each peer node can cache all the layers that it has received 
and each receiving node can identify the available bandwidths between it and 
the candidate supplier nodes, and these available bandwidths keep unchanged 
without the failures of the supplier nodes. 

There has been related work on layered peer-to-peer streaming. For example, 
[1] proposed a framework for quality adaptive playback of layered media stream- 
ing, the data allocation granularity of it was based on data packets, and ours is 
based on layers. To our knowledge, the most closely related work is [2], and the 
main differences between it and this paper include, 1) this paper presents an im- 
proved layer allocation algorithmin compares to [2], in the scenario of sufficient 
and with the assumption that the layer rates are heterogeneous; 2) in addition, 
this paper discusses and formalizes the problem in the scenario of insufficient 
which is not mentioned in [2], and present a heuristic layer allocation algorithm. 

2 System Model and Layer Allocation Algorithms 

Let p denote a receiving node, po denote the media server, I 2 , ■ ■ ■ , h denote 
the layers of the stream that p will request, with li as the base layer and others 
as the enhancement layers, these layers are accumulative, i.e., h can be only 
decoded if layers h through k-i are available. Let denote the streaming rate of 
li and the layer rates are heterogeneous, L = {li, I 2 , ■ ■ ■ , Im} {m < 1) denote the 
expected layer set that p expects to receive according to its inbound bandwidth 
and the layer rates. Let P = {pi, p 2 , . . . , p„} denote the candidate supplying 
peer node set of p, b{pj, p) (0 < j < n) denote the available bandwidth between 
Pj and p, Oi denote the number of layers that pi has cached, Xi^j denote whether 
li will be provided by pj, if will then assigned to 1 otherwise assigned to 0. Let 
L{pj) denote a subset of L where the layers are provided by pj. 

2.1 The Scenario of Sufficient 

In this scenario, the goal of the layer allocation problem is to maximally save the 
resource consumption of the server. We can formalize this problem as follows: 

m n 

Maximiz EE Xij *rj (1) 

1=1 j=i 
n 

Subject to Xij = 1, f = 1, 2, . . . , m 
j=o 



(2) 
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Xi,j * i < Qj . f = 1, 2, . . . , TO, 


J = 1, 2, . . 


., n 


(4) 



Formulation (1) indicates that the goal is to maximally allocate the expected 
layers to the peer set P. Equation (2) indicates each layer can only be allocated 
to a candidate supplier. Inequation (3) indicates the sum rate of the layers that 
allocated to a peer node can not exceed the available bandwidth between the 
peer node and p. Inequality (4) indicates the layers that a peer can provide can 
not exceed what it has cached. 

From the formulations (l)-(4) we can see that this optimal allocation prob- 
lem belongs to the integer programming (IP) problems, it is NP-Hard for these 
constraints especially each variable Xij must be 0 or 1. Here we present an ap- 
proximation algorithm for this problem which includes three steps: 1) relax the 
integrality of each Xij , and thus make the IP problem become a linear program- 
ming (LP) problem, 2) solve the LP problem by a maximum flow algorithm on a 
constructed directed graph, 3) adjust the flow value on the graph until each Xij 
becomes 0 or 1. Next we will describe this approximation algorithm in details. 

We first relax the integrality of each variable Xij and make this optimization 
problem become a LP problem. As the dimension of the matrix that correspond- 
ing to the LP problem’s constraints is large, here we don’t try to solve it using 
the traditional simplex or dual-simplex algorithms, but convert it to a maximum 
flow problem on a directed graph G{V, E). G{V, E) is constructed as follows: the 
vertex set V = LUPU{s,t} where s is the source, t is the sink, L and P keep the 
meanings mentioned above; for each li, direct an arc from s to k, its capacity is 
assigned rp, for each k and each pj, if pj has cached h and b{pj,p) is not less than 
Ti, then direct an arc e{li,pj) with capacity Vi from U to pj] for each pj, direct an 
arc form pj to t, its capacity is assigned b{pj,p). From G(P, E)’s construction, 
we can conclude that the maximum sum rate of the layers that allocated to the 
peers in P can not exceed the maximum flow value through G{V, E). 

Definition 1. To any arc e{li,pj) that belongs to {e{li,pj)\l < i < m, 1 < 
j < n} of G{V,E), if the flow value on e{li,pj) equals to ri or 0, then call this 
arc an integral arc, otherwise a fractional arc. 

After constructing the directed graph, the next step of the algorithm is to 
calculate the maximum flow on G{V,E), here we use the classical algorithm 
MPM [3] to calculate the maximum flow and let w{li,pj) denote the flow value 
on e{li,pj). After this calculation each variable Xij is assigned to w{li,pj)/ri. 
If Xij equals to 1, then add k to the set L{pj). If each arc of {e{k,pj)\l < i < 
TO, t < j < n} is integral at this time, then end this algorithm for it has got an 
optimal allocation result; otherwise further adjust the flow on G{V, E) as follows. 

Construct a edge-induced subgraph G{V',E') of G{V,E), G{V',E') is in- 
duced by those fractional arcs from the arc set {e{k,pj)\l < i < m, I < j < n} 
of G{V, E), all parameters on these arcs are kept unchanged in the construction. 

Definition 2. To any vertex that belonging to P of G{V ,E'), if a vertex’s 
in-degree equals to 1, then call this vertex a singleton vertex. 
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Repeat the following substeps until all the arcs of G{V\ E') are removed: if 
there does not exist any singleton vertex, then adjust the flow according to the 
rounding rule 1, else adjust the flow according to the rounding rule 2. 




(a) § <e ih) 5 > £ 

Fig. 1. The flow adjustment on G{V , E') 



Rounding rule 1: Ignore the edge directions of G{V ,E') and And a longest 
path on G{V',E'). As at this time there does not exist any singleton vertex, 
the two end vertices of the longest path must belong to L, otherwise leads a 
contradiction. Suppose the two end vertices of the longest path are U and Ij, 
denote the longest path as U ~ Ij. Let 6 denote the minimum flow value of all 
the arcs comprised in U ~ Ij, e{lx,Py) denote the arc whose flow value equals 
to 6 on li ^ Ij. Let e denote the minimum remaining capacity of all the arcs 
comprised in li ~ Ij, and e{lk,Pg) denote the arc whose remaining capacity 
equals to e on li ^ Ij. Execute either the following branches on li ~ Ij-. (i) If 
S < e, adjust the flow as illustrated in Fig. 1(a), we can see that the flow value 
on the arc e{lx,Py) will become zero, which means at least an arc {e{lx,Py)) 
will become an integral arc after this type adjustment; (ii) if (5 > e, adjust the 
flow as illustrated in Fig. 1(b), we can see that the flow value on the arc e{lk,Pg) 
will equal to its capacity, which means at least an arc (e(lk,Pg)) will become an 
integral arc after this type adjustment. At last remove some arcs of G{V , E') as 
follows: to each arc (suppose e{lu,Pv)) of the path k ~ Ij, if its flow value equals 
to zero, then remove it from G{V',E'); if its flow value equals to its capacity, 
then add to L{py), remove all the arcs that associated with 1^. We can see 
after the above adjustments, the flow value through each pj{l < j < n) keeps 
unchanged, the flow value on each arc does not exceed its capacity. 

Rounding rule 2: Select a existent singleton vertex on GiV' ,E'), suppose it 
is pj. pj's single neighbor vertex must belong to L, suppose it is U. Remove all 
arcs that associated with k, add k to L{pj), construct a subset of L{pj) that the 
layer’s sum rate of this subset does not exceed the available bandwidth b{pj,p) 
but is nearest to b{pj,p), substitute this subset for L{pj). 

The maximum flow on G{V,E) can be computed in 0(|yp) time. At least 
an arc will be removed in either the rounding rules, and there at most exist 
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m *n arcs on G{V , E'), so this layer allocation algorithm is a polynomial time 
algorithm. 

2.2 The Scenario of Insufficient 

In this scenario, as the receiving node’s expected layers can not solely be supplied 
by the media server, the goal of the layer allocation algorithm is to maximize 
the streaming quality of the receiving node. Considering that a layer li can only 
be decoded if layers h through are available, we first define a special layer 
subset of L before formalize the allocation goal of this scenario. 

Definition 3. Suppose S is a subset of L, we call S a prefixed- sub set of L if 
and only if the following condition satisfied: if layer li{l <i<m) belongs to S, 
then all the layer(s) {lj\ \ < j < i} must also belong to S. 

Suppose S' is a prefixed-subset of L to be constructed, the allocation goal can 
be formalized as 



Maximize [S'! 






(5) 


Subject to ''^^Xij - 


— 1, % . li 


G S 


(6) 


3=0 








^ x,jr, < b{pj,p), j 


= 0, 2, .. 


., n 


(7) 


i:UeS 








Xi,j * Z < Uj . Z = 1, 2, . . . , TO, j 


= 1, 2, .. 


., n 


(8) 



Formulation (5) indicates the goal is to maximize the size of S, which means to 
maximize the receiver’s streaming quality by optimally allocating layers to the 
candidate supplying node set P U {po}- Equation (6) indicates each layer in S 
must be allocated to a node of P U {po}- The meanings of the formulation (7) 
and (8) are similar to the meanings of the formulation (3) and (4). 

Theorem 1. The optimal layer allocation problem in the scenario of insufficient 
is NP-Hard. 

Proof: Consider a decision problem that corresponds to the optimal layer 
allocation problem in the scenario of insufficient. Regard any prefixed-subset 
L' = {I 1 J 2 , ■■■,lk}{k < to) of L as an item set, each item k is associated with 
a size regard the candidate supplier set P U {po} as a bin set, each bin 
Pi(0 Si j n) is associated with a capacity b{pj,p); for the simplicity let oi = 
02 = ... = a„ = TO which means any item is allowed to be placed into any bin. 
Then the decision problem can be described as: if all the items of L' can be 
put into the bin set P U {po} while the sum size of the item(s) that placed into 
each bin does not exceed the bin’s capacity. As the 3-partition problem can be 
regarded as a special case of this decision problem and the 3-partition problem 
is NP-Complete [6], so the above decision problem is NP-Complete, and the 
corresponding optimal layer allocation problem is NP-Hard. 

Here we propose an heuristic algorithm (Fig. 2) for this optimization problem, 
its main idea is to allocate the layers from the base layer to the enhancement 
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layers in turn and allocate each layer to a best suitable supplier. The algorithm’s 
complexity is 0{mn). 



1. let U ^ <P, L{pj) ^ ^ (0 < j < n); 

2. for i ^ 1 to m do 

3. for j ^ 0 to n do 

4. if Uj > i and b{pj,p) > Vi then U U + {pj}', 

5. end for 

6. if |[/| = 0 then 

7. return i — 1 and exit; /* i — 1 is the number of allocated layers */ 

8. if |?7| = 1 then /* suppose U equals to {pj}*/ 

9. L{pj) ^ L{pj) + {li}, b{pj,p) ^ b{pj,p) -n- 

10. if |?7| > 1 then 

11. select the supplier(s) that with the least available bandwidth to p ; 

12. if multiple such suppliers exist, then 

13. select a supplier which cached the least number of the layers; 

14. if the selected supplier is pj then 

15. L{pj) ^ L(pj) + {L}, b{pj,p) ^ b{pj,p) - n ; 

16. let U 

17. end for 



Fig. 2. The heuristic algorithm to maximize a receiver’s streaming quality 



2.3 Fault- Tolerance Mechanism 

Node departures can happen frequently in peer-to-peer community. Our fault- 
tolerance mechanism that adapts to this can be described as follows: when a 
receiving node detects that a supplying node has ceased to provide the media 
data, it sends a request to the media server and tries to get the missing layers 
from it; if the receiving node’s request can be satisfied by the media server, then 
it will get the missing layers from the media server; otherwise, on the side of the 
receiving node the algorithm that described in Section 2.2 will be executed to 
reallocate the layers to the candidate suppliers thus try to maximize the receiving 
node’s streaming quality. 



3 Simulation Results 

In this section, we present simulation results to evaluate the performance of 
the proposed algorithms. A 6-layered encoding mode whose layer rates are from 
64kbps to 192kbps and a 10-layered encoding mode whose layers rates are from 
64kbps to 128kbps are used in the simulation, both with the sum rate 768kbps 
(denoted as i?o)- The topology used in the simulation has three levels: the first 
two levels are generated using the GT-ITM Generator [5] which contain 400 
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routers, and 1000 host nodes are attached to the stub routers as the lowest level. 
The media server is attached to a transit router. The link bandwidths of the first 
level are assigned lOOM, and are chosen in the range [IM, lOM] of the second 
level. The available inbound bandwidths of nodes are distributed in the range 
[128K, IM], the outbound bandwidth that each node willing to contribute is 
distributed in the range [0.1i?o, 0.6i?o] which reflects the diversity in the P2P 
community [4]. 

As [2] had proposed a greedy layer allocation algorithm which had the similar 
goal and assumptions to the algorithm presented in Section 2.1, and in the 
simulation we will compare the performance of the two algorithms, and name the 
algorithm in [2] as GEBALAMiGrl^eA Based Approximation Layer Allocation 
algorithM) and ours as F ABALAM (Flow Adjustment Based Approximation 
Layer Allocation algorithM) in the simulation. 

A experiment is simulated as follows with first setting a lower-limit: each 
node joins the system by submitting a request to the server in random order; 
the server constructs a candidate supplying peer set P as a response to a request 
that received according the statuses of peers that have joined, each peer in P can 
at least supply a expected layer of the requesting node; if the size of P is larger 
than the lower-limit setting, then P and po will be returned to the requesting 
node, and either of the algorithms presented in this paper will be executed on 
the requesting node side according to either the scenarios; otherwise only po is 
returned and the requesting node will begin to get data from po; each node will 
become a peer node when it begins to receive the media data; after all nodes have 
joined the system, we repeat the above procedure but substitute the algorithm 
GEBALAM for F ABALAM . Each experiment is performed 10 times, and the 
simulation results are averaged over all results. 




(a) Performance comparision on 
saving server's bandwidth consumption 
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(b) Performance on maximizing the 
streaming qualities of receiving nodes 



Fig. 3. Performance of the proposed algorithms 



Fig. 3(a) shows the performance comparison results on saving the server’s 
bandwidth consumption of the two algorithms with the lower-limit 6 and the 10- 
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layered encoding mode. The axis x denotes the number of nodes that have joined 
the system. The y axis denotes the ratio of the server’s bandwidth consumption 
to the sum of the bandwidths that consumed by the joined nodes. It shows that 
our proposed algorithm can save more bandwidth consumption (2%~3%) of the 
media server than the corresponding greedy algorithm proposed in [2] . 

Define a metric streaming quality satisfaction as the ratio of the sum rate of 
the layers that a node actually receives to the sum rate of its expected layers. We 
select these allocation samples in each experiment where the available bandwidth 
between a receiving node and the media server is insufficient and let the average 
streaming quality satisfaction of all the samples to denote the performance of the 
algorithm that presented in Section 2.2. From Fig. 3(b) it can be seen that the 
streaming qualities of these receiving nodes keep at high levels, furthermore, the 
larger the lower-limit setting, the higher the steaming qualities, this is because 
the more the number of candidate supplying peer nodes, the more likely it is 
that a receiving node can receive more expected layers. 

4 Conclusions 

This paper presented a layer allocation framework for layered peer-to-peer stream- 
ing. Considering the constraints of the supplying peer nodes and the increasing 
trend of the server’s bandwidth consumption, this framework comprised two 
algorithms according to the different scenarios of the available bandwidth be- 
tween the media server and a receiving node to reach either or two of the goals: 
1) minimize the resource consumption of the media server, 2) maximize the re- 
ceiving node’s streaming quality. Simulation studies show the efficiency of the 
layer allocation framework. Our fault-tolerance mechanism is the initial result of 
the work, we plan to explore more mechanisms such as the buffer management, 
standby peer selections and so on which constitute the possible directions of our 
future work. 
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Abstract. While Web Services already provide distributed operation execution, 
the registration and discovery with UDDI is still based on a centralized 
repository. In this paper we propose a distributed XML repository, based on a 
Peer-to-Peer infrastructure called pXRepository for Web Service discovery. In 
pXRepository, the service descriptions are managed in a completely 
decentralized way. Moreover, since the basic Peer-to-Peer routing algorithm 
cannot be applied directly in the service discovery process, we extend the basic 
Peer-to-Peer routing algorithm with XML support, which enables pXRepository 
to support XPath-based composite queries. Experimental results show that 
pXRepository has good robustness and scalability. 



1, Introduction 

Web services are much more loosely coupled than traditional distributed applications. 
Current Web Service discovery employs a centralized repository such as UDDI[1], 
which leads to a single point of failure and performance bottleneck. The repository is 
critical to the ultimate utility of the Web Services and must support scalable, flexible 
and robust discovery mechanisms. Since Web services are widely deployed on a huge 
amount of machines across the Internet, it is highly demanded to manage these Web 
Services based on a decentralized repository. 

Peer-to-peer, as a complete distributed computing model, could supply a good 
scheme to build the decentralized repository for the Web Service discovery. Existing 
Peer-to-Peer systems such as CFS[3] and PAST[4] seek to take advantage of the rapid 
growth of resources to provide inexpensive, highly available storage without 
centralized servers. Flowever, because Web Services utilize XML -based open 
standard, such as WSDL for service definition and SOAP for service invocation, 
directly importing these systems by treating XML documents as common files will 
make Web Service discovery inefficient. INS/Twine[5] seems to provide a good 
solution for building the Peer-to-Peer XML repository. Flowever, INS/Twine does not 
provide a solution to provide XPath-like query. 
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We designed a deeentralize XML repository for Web serviee diseovery based on 
stmetured Peer-to-Peer network named pXRepository (Peer-to-Peer XML 
Repository). Unlike Twine, as We allow index keys to be tree-struetured or non- 
prefix sub-keys. For improved sealability, index entries are further organized 
hierarehieally. We have extended the Peer-to-Peer routing algorithm based on 
Chord[2] for supporting XPath eomposite query in pXRepository. We name this 
algorithm eXChord (extended XML based Chord). Experimental results have shown 
that pXRepository has good sealability and robustness. 



2, System Overview 

pXRepository is a Peer-to-Peer XML storage faeility. Eaeh peer in pXRepository aets 
as a serviee peer (SP for simplieity), whieh not only provides Web serviee aeeess, but 
also aets as a peer in the Peer-to-Peer XML storage overlay network. The arehiteeture 
of the serviee peer in pXRepository is shown in Fig. 1. 




Fig. 1. The architecture of the service peer 



In pXRepository, XPath[6] is used as query language for retrieving XML 
doeuments stored over the Peer-to-Peer storage network. Eaeh SP eonsists of three 
aetive eomponents ealled the Web Serviee Diseovery Interfaee, the eore eomponent 
and the router, and a passive eomponent ealled the loeal repository. Web Serviee 
Diseovery Interfaee provides aeeess interfaee to publish or loeate Web serviees and 
also exposes itself as a Web serviee. The serviee deseription resolver is responsible 
for extraeting key nodes from a deseription. Eaeh key node extraeted from the 
deseription is independently passed to the serviee key mapper eomponent, together 
with serviee deseription or query. The serviee key mapper is responsible for 
assoeiating FIID(FIash ID) with eaeh key node. It does this by hashing the key node. 
More details are given in seetion 3. The Query Resolver and the Query Key Mapper 
work almost the same way as the Serviee Deseription Resolver and the Serviee Key 
Mapper do exeept that the Query Resolver generates query string based on the parsing 
tree, whieh is produeed by XPath parser. Serviee mapper is responsible for mapping 
FIIDs to serviee deseriptions and will return the results to the applieation serviees 
through the Web serviee diseovery interfaee. Loeal repository keeps the Web serviee 
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interface, service descriptions and HIDs that SP is responsible for. The router routes 
query requests and returns routing results. 

In pXRepository, we organize every service peer in a structured Peer-to-Peer 
overlay network. Because Chord has the features of simplicity, provable correctness, 
and provable performance compared with other lookup protocols, we use Chord 
protocol to organize the SP’s routing table. 



3, Web Service Discovery in pXRepository 

Service locating algorithm speciftes how to route the query to the service peer who 
satisfies the service request. In pXRepository, the service request is expressed in 
XPath. However, the routing algorithm. Chord, in underlying Peer-to-Peer overlay 
network only supports exact-match. We extend the Chord algorithm to support XPath 
based match. The extended Chord algorithm is called eXChord. 

In pXRepository, WSDL is used to describe the Web service interface, and the 
service description metadata is generated based on the content of WSDL document 
and the description that the user inputs before publishing. An example of Web service 
description metadata is shown in Fig. 2. 



<services><service> 

<name>ListPriceService</name> 

<documentation>List the product price</documentation> 
<location>http : //services . companya . com/product/ 

Li St Product Service . wsdl</locationx/service> 
<service><name>OrderService</name> 

<documentation>Make order to product</documentation> 
<location>http : //services . companya . com/product/ 

Order Service . wsdl</locationx/service></services> 
<descriptionXcompany>CompanyA</company> 
<industry>Manuf actor y</ indust ryXregion>China</region> 
<keyword>Automobile Price Order</keyword> 

< comment s> </ comment s> 

</description> 



Fig. 2. An example of Web service description metadata in pXRepository 



Root 




Fig. 3. A sample NVTree converted from service description shown in Fig. 2. 
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To publish the Web Services, the Web Service description metadata will be first 
converted to a canonical form: a node-value tree (NVTree). Fig. 3 shows an example 
of the NVTree converted from the Web Service description shown in Fig. 2. 

pXRepository extracts each node from the NVTree. Fig. 4 shows the concatenating 
strings produced from the left sub-tree in Fig. 3. Each node in Fig. 4 is associated with 
a string called node key (denoted by S1,S2,...). The node key is concatenated by the 
child node keys and its seifs node value with a slash(/) between them. If the node has 
multiple child nodes, each child node key is enclosed by a bracket and concatenated 
in left-to-right order. The concatenating process is a recursive step in post tree 
scanning order. The right sub-tree in Fig. 3 is produced in the same way. 



SI 


listprice 


S2 


listprice/name 


S3 


product 


S4 


price 


S5 


(product][price]/documentation 


S6 


(listprice/name][[productllprice]/documentation)/service 


S7 


orderservice 


SB 


orderservice/name 


S9 


order 


S1Q 


product 


S11 


(order][productj/documentation 


S12 


(orderservice/name][[orderj(product]]/documentation 


S13 


lS6][S12]/setvices 




Fig. 4. Splitting a NVTree into service description key strings 



Each node key is passed to the hash function to produce a FIID, which will be used 
as a key to insert into the underlying Peer-to-Peer overlay network. In pXRepository, 
the hierarchical relationship of the nodes in NVTree will be preserved as shown in 
Fig. 5. Each element in Fig. 5 resides in a specific service peer in pXRepository 
correspond to the hash value of its node key (denoted as h(Sl), h(S2),. . .). 



|li(S13)J^S13|UR|NULL| 



|h(S6)| S6 |UR|h(S13)| 



|h(S2]| S2 |UR|h(S6)| 

[htSSf S5 |UR|h(S6]| 





|h(S8)| S3|UR|h(S12)| 

|h(S11)|S11 |UR|h(Si^ 



|h(S1)| SI |UR|h(S2)y ^ |h(S7)| S7 |UR|h(S8:^ j 

[h(S3)| S3|UR|h(S5)| |h(S4)| S4 |UR|h(S5)| |h(S9)| S9 |U R|h(S9)| [Ti(S10) |S10|UR|h(S10)| 

Fig. 5. Distributing node keys across pXRepository 



Before presenting pXRepository service publishing and locating algorithm which 
is named as eXChord, we first introduce some definitions: 
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Definition 1. Let SD stands for a Web Service description document, then U(SD) 
stands for the URI of the document, P (SD) represents the NVTree of SD, and N (SD) 
stands for the set of NVTree nodes, where N(SD)={V;, N 2 }. 

Definition 2. Let N stands for a NVTree node, then P (N) stands for its parent node 
and K (N) represents the node key. 

The pseudocode of eXChord service description publishing algorithm is given in 
Fig. 6. Function Publish is run on peer n, take a service description (SD) as input and 
publishes the SD into the Peer-to-Peer overlay network. 



1 n.Publish(5Z)) { 

2 fey=hash(U(5Z))); n '=n.Route(fey); // Chord routing algorittm 

3 n'.lnsert{key, SD); ComputeSI{SD)={Ni, N 2 

4 for each W in N (SD) { 

5 nodekey=V. (N(); parentkey=ik(P (N,)); 

6 nDistrihute(nodekey,parentkey,SD); } 

V } 

8 n.Distribute(«L pk, SD) { 

9 id = hash(nk); n'=n.Route(id); 

10 n '.Insert(/d, nk, U(5Z)), hash(pi)); 

11 } 

Fig. 6. The pseudocode of eXChord service description publishing algorithm 



1 Let R= O // R is the result set of the query 

2 n.Locate(2/)){ 

3 Compute N {QD)={Ni, N 2 , }; 

4 for each N,- in N (QD) { 

5 /:ey=hash( K (A^)); « '=«.Route(^ey); 

6 NV= n \gQi{key)\ !*NV represents the set of nodes having the 

node key value of K (A^,) */ 

7 for each NV^ in A^K { 

8 SD= n '.Match(A^C, QD)\ /*Macth is a recursive process finding 

matching document set*/ 

9 ifSD \=NULL then R=RU5/);} 

10 } 

11 } 

12 n.Match(W QD){ 

13 status = IsMatch(A^, QD y,/*status can be Yes, No, or Not Sure*/ 

14 if 5'r^/?w5=Yes then 

15 return SD\ H SD is the service description document that matches QD 

16 elseif status=No then return Yt/LZ; 

17 else{ 

1 8 ^ey=hash( K ( P (Y))); //get parent node key hash value 

19 « '=rt.Route(^ey); NV= n' .gQt{key)\ 

20 for each NVj in NV{ 

2 1 SD= n '.Match(YC, QD)\ 

22 if SD l=NULL then R=R U 

23 } 

24 } 



Fig. 7. The pseudocode of eXChord service locating algorithm. 



To search for a Web Service, the client must specify the query requirement, which 
is expressed in XPath language. pXRepository supports composite XPath queries, and 






180 



Y. Li et al. 



each XPath query only contains text matching constraints. An XPath query can be 
converted to a tree called XPTree. Because each node key in NVTree preserves the 
sub-tree structure information, the Web Service searching process can be a sub-tree 
matching problem. 

To present the Web Service locating algorithm, we first introduce some definitions: 
Definition 3. Let QD stands for a XPath query, then L (QD) represents the XPTree 
of QD, and N (QD) stands for the set of leaf nodes, where N (QD)={Ay, N2 , ...,Nm } • 
Definition 4. Let N stands for a XPTree node, then K (N) represents the value of 
the node. 

The pseudocode of eXChord service locating algorithm is given in Fig. 7. Function 
Locate is run on peer n, take query requirement QD as its input and searches the Peer- 
to-Peer overlay network for the services that satisfy its requirements. 



4. Evaluation and Experimental Results 

In this section, we evaluate pXRepository by simulation and compare pXRepository 
with centralized service management approach such as UDDI. We compared latency, 
space overhead, load, and robustness of pXRepository with UDDI. 

We use the Georgia Tech Internetwork Topological Models (GT-ITM)[7] to 
generate the network topologies used in our simulations. We use the “transit-stub” 
model to obtain topologies that more closely resemble the Internet hierarchy than pure 
random graph. An Internetwork with 600 routers and 28800 service peers (node for 
simplicity) are used in our experiment. 



4.1 Latency 

We evaluate the latency metric in the number of the hops in the network. Fig. 8 shows 
the effect of number of nodes on latency. Since the routing table of pXRepository is 
same as that of Chord, which has the logarithmic relationship between logical hops 
and number of the nodes. If the average latency of single logical hop is K , thus, 
latency of pXRepository LatewcypXRepository = x log(A//). 

Experimental results further show that the latency of UDDI is roughly 80. This is 
because the logical hops of UDDI is 2, and has nothing to do with Nh- Although 
latency of pXRepository is higher than that of UDDI, pXRepository has lower space 
overhead, lower load, and good robustness (refers to 4.2, 4.3, 4.4). 



4.2 Space Overhead 

In order to analyze conveniently, we first give a definition and two assumptions: 
Definition 5. The memory size of routing ID and corresponding IP is called 
memory unit, size of which is cr . 

Assumption 1. Node ID and Web service description ID are distributed uniformly 
in ID space, and there are n service peers in the system, each peer publishes m service 
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description documents in average. For each service description, s keys will be 
generated by extracting the concatenating strings from the service description. 
Assumption 2. The average overhead of each service description item is kx <j . 
Overhead of pXRepository is: 

Space pXRepository = log(«) xcr + mxyxA:xcr (1) 

Since UDDI maintains all information in central repository, overhead of UDDI is: 
SpaceuDDi = nxmxkxa (2) 

For m ^ 80 , k = 2 , s=10 and cr ^200, the effect of number of nodes Nh on space 
overhead is shown in Fig. 9. Experimental results in Fig. 9 reveal that the space 
overhead of pXRepository is much better than that of UDDI. 
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Fig. 10. Load of pXRepository and UDDI 



Fig. 11. The effect of node failure 



4.3 Load 

Load is an important metric to evaluate Web service management approach. This 
paper uses the number of messages in and out of the node as a metric to evaluate the 
load. We assume that each service lookup query will generate 3 keys in average, and 
each key will be used to locate the service. Fig. 10 shows the comparisons in loads 
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between pXRepository and UDDI. The load of UDDI inereases linearly with the 
number of nodes, however, the load of pXRepository and the number of nodes is in 
logarithmic relationship. 



4.4 Robustness 

In this experiment, we consider 28800 nodes with each node distribute 10 pieces of 
service descriptions, and each service description will generate 10 keys. We also 
assume that each service query requirement issued by client user will generate 2 keys 
covered by the service description keys. Then we randomly select a fraction of the 
nodes that fail. After the failures occur, we wait for the network to stabilizing, and 
then measure the fraction of the keys that could not be located correctly. Fig. 1 1 plots 
the effect of node failure on service lookup. 

Since in eXChord algorithm used in pXRepository, for each service key, there will 
be a service description copy distributed in the underlying Peer-to-Peer overlay 
network, the client can still continue to locate the appropriate service description by 
using other keys in case some service peers fail. 



5. Conclusions 

In this paper, we propose a distributed XML repository, based on a Peer-to-Peer 
infrastructure called pXRepository for Web Service discovery. In pXRepository, the 
Web service descriptions are managed in a totally distributed way, which avoids 
single point of failure and can be scalable and robust. We also extend the basic Peer- 
to-Peer routing algorithm with XPath support. 
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Abstract. Hint-based Locating & Routing Mechanism (HBLR) derives from 
the locating & routing mechanism in Freenet. HBLR uses file location hint to 
enhance the performance of file searching and downloading. In comparison 
with its ancestor, HBLR saves storage space and reduces file request latency. 
However, because of the inherent fallibility of hint, employing location hint 
naively for file locating in P2P file-sharing system will lead to under-expectant 
performance. In this paper, hint’s uncertainty and related bad results are ana- 
lyzed. According to the causation, pertinent countermeasures including credible 
hint, master-copy transfer, and file existence pre-detection are proposed. Simu- 
lation shows the performance of HBLR is improved by adopting the proposed 
policies. 



1 Introduction 

The Hint-based Loeating & Routing Meehanism (HBLR) [3] is based on the docu- 
ment routing model [1] in Freenet [2]. It proeesses file loeation hints searehing instead 
of direet file searehing and distributes file loeation hints instead of files themselves. 
Comparing to Freenet, HBLR has two obvious advantages: disk spaee saving on 
peers, and file transfer time redueing by downloading file from the seleeted peer. 

However, distributing hints instead of files also has its drawbaeks. Following are 
several problems aeeompanied with proposed solutions: 

1 . Stale hint set. If all hints that a requestor reeeives are stale, the request will fail 
undoubtedly. However, the file may exist on some peers, most probably on the file’s 
publisher. So, employing the publisher as a eredible file holder, and keeping the eor- 
responding hint on peers ean provide at least one downloading souree. 

2. Credible file holder absenee. If a publisher eneounters spaee shortage, it may 
delete previously published files. Moving the file to another peer, and maintaining 
neeessary links between the old and new holders ean resolve the problem. 

3. Repetitious downloading attempts. If the hints pointing to some relatively better 
positions are all stale, the requestor will fail to download several times until a eorreet 
hint related position in its turn. It’s ineffieient and a file existenee pre-deteetion 
meehanism is needed. 
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The rest of this paper is organized as follows: Seetion 2 deseribes the proposed 
meehanisms. Seetion 3 gives simulation and diseussion. Seetion 4 gives some related 
work. Finally, the author’s work is summarized at Seetion 5. 



2 Improving Hints’ Accuracy in HBLR 

The searehing and distribution methods for fde in Freenet and for hint in FIBLR are 
equal. Therewith, their searehing sueeess ratios are almost equal too. If the hint’s ae- 
euraey ean be ensured, FIBLR’ s performanee ean be improved. 



2.1 Credible Hint and Master-Copy 

For a given fde, if (1) there is a eredible file holder, (2) hint about this holder is kept 
on all the peers, whieh have the hint entries about the given file, and (3) the hint about 
the eredible holder will always be transferred to the requestor, then HBLR’s request 
sueeess ratio ean be improved to Freenet’s level. 

A publisher, whieh provides the fde for share, will keep the file as long as possible. 
So, the publisher is a quite eredible holder, eondition (1) ean be satisfied. In finite-hint 
[3] solution, every hint items are transferred baek naturally. In full-hint [3] solution, 
as long as a speeial poliey is adopted, assuring the publisher related hint to be trans- 
ferred baek, eondition (3) ean be satisfied. A peer in HBLR obtains hints during three 
proeedures: (a) Publishing; (b) Hints transferring in request; (e) Updating after fde 
downloading. No matter in whieh way, so long as eondition (3) is satisfied, the holder 
related hint eould always be kept. So, eondition (2) ean be satisfied. 

We eall the publisher credible holder, and the publisher related hint credible hint. 
The eopy of a file on its publisher is ealled master-copy. 



2.2 Master-Copy Transfer 

The file’s publisher is a quite eredible holder, but not a eompletely eredible holder. 
When it eneounters spaee shortage, it may delete a previously published file. The 
publisher should transfer the master-eopy to a neighbor peer. The eriteria in target 
peer seleetion inelude free spaee and eommunieation lateney. The publisher also re- 
mains a link pointing to the new holder for master-eopy aeeess. The new holder is 
also enabled to transfer the fde while faeing spaee shortage. 



2.3 File Existence Pre-Detection 

After reeeiving hints, the requestor seleets a best peer for downloading. But, if the fde 
on the best peer was deleted, the downloading attempt fails, and reseleeting is needed. 
Though with eredible hint and master-eopy transferring meehanism, the file will be 
obtained in the end, repetitious attempts are ineffieient. File existenee pre-deteetion 
should be involved in the seleetion step, and file exisfenee beeomes a preeondifion. 
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3 Simulations 

The success ratio, space usage, and average file service time were concerned in simu- 
lations. 100 peers and 4000 different files were involved. The files’ average size was 
variable and each peer contributed SOOMbytes storage space to Freenet or HBLR. 




Fig. 1. Success ratio of file requests 



Fteer 




Fig. 2. Space usage 




Fig. 3. File service time 

Success ratios of requests were compared. The two systems both processed ten 
continuous request sequences, each containing 1000 requests. The files’ average size 
was adjusted among 2M/2.5M/3Mbytes. As Fig. 1 shows, HBLR achieved approxi- 
mate success ratio to Freenet. When file size became bigger, HBLR’s success ratio 
exceeded that of Freenet in latter sequences. The cause was that Freenet deleted some 
cached copies for providing space to the new coming, thus reduced copy distribution. 
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Space usage was seriously concerned. The simulation process included 3000 fde 
requests. The fdes’ average size was adjusted from IMbytes to SMbytes. As Fig. 2 
shows, Freenet used much more space than FIBLR, especially with bigger fdes. 

As Fig. 3 shows, FIBLR paid out for its relatively complex operation, its perform- 
ance of average file service time was below Freenet with smaller fdes. But it did bet- 
ter when files were bigger. Because the bigger the files were the less copies distrib- 
uted in Freenet. 

According to the simulations, FIBLR with credible hint and master-copy transfer- 
ring mechanism behaved well. 



4 Related Work 

The improving work was based on FIBLR [3], which derived from Freenet [2]. The 
basic locating models in the systems are all document routing model [1]. 

Cooperation is an essential character and requirement in P2P. Other cooperative 
systems’ experience can be referenced for building P2P cooperative mechanisms. P. 
Sarkar and J. Ft. Ftartman proposed a cooperative caching mechanism, using master 
copy and hint [4]. Michael D. Dahlin et al developed another cooperative caching 
model to improving distributed file system performance [5]. In their models, useful 
copies are forwarded to some other places when facing local deletion. 



5 Conclusions and Future Work 

In this paper, credible hint and master-copy transferring mechanism were imported to 
resolve hint uncertainty problem in FIBLR. These new elements performed energeti- 
cally in FIBLR, improved its performance. 

FIBLR is an ongoing project and some work needs to be done in the future. The 
credible hint is a minimum step for improving hint’s accuracy. Advanced hint man- 
agement mechanism is still needed. 
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Abstract. As audio and video applications have proliferated on the 
Internet, transcoding proxy caching is attracting an increasing amount 
of attention, especially in the environment of mobile appliances. Since 
cache replacement and consistency algorithms are two factors that play 
a central role in the functionality of transcoding proxy caching, it is 
of particular practical necessity to involve them into transcoding cache 
design. In this paper, we propose an original cache maintenance algo- 
rithm, which integrates both cache replacement and consistence algo- 
rithms. Our algorithm also explicitly reveals the new emerging factors 
in the transcoding proxy. Specifically, we formulate a geueralized cost 
saviug function to evaluate the profit of caching a multimedia object. 
Our algorithm evicts the objects based on the generalized cost saving 
to fetch each object into the cache. Consequently, the objects with less 
generalized cost saving are to be removed from the cache. On the other 
hand, our algorithm also considers the validation and write rates of the 
objects, which is of considerable importance for a cache maintenance 
algorithm. Finally, we evaluate our algorithm on different performance 
metrics through extensive simulation experiments. The implementation 
results show that our algorithm outperforms comparison algorithms in 
terms of the performance metrics considered. 



Key words: Transcoding proxy caching, cache maintenance. Cache replacement, 
cache consistency. World Wide Web. 

1 Introduction 

With the explosive growth of the World Wide Web, proxy caching has become 
an important technique to improve network performance [15,16]. Due to the 
limited cache space, it is impossible to store all the web objects in the cache. 
As a result, cache replacement algorithms [10,14,16] are used to determine a 
suitable subset of web objects to be removed from the cache to make room 
for a new web object. However, the improvement of network performance, such 
as access latency reduction achieved by caching web objects, does not come 
completely for free. In particular, maintaining the content consistence with the 
primary servers generates extra requests. Many proxy cache implementations 
depend on a consistency algorithm to ensure a suitable form of consistency for the 
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cached documents. Cache consistency algorithms [1,3,11] are used to guarantee 
the consistency of the cache web objects. 

Transcoding is used to transform a multimedia object from one form to an- 
other, frequently trading off object fidelity for size for prevailing the operating 
environment. Since the transcoding proxy plays an important role in the func- 
tionality of caching, transcoding proxy caching is attracting more and more 
attention [4,7,9,12]. However, due to the new emerging factors in the environ- 
ment of transcoding proxies, existing cache replacement and consistency algo- 
rithms cannot be simply applied to solve the same problems for transcoding 
proxy caching. In [5], the authors presented several examples to explain the 
influence of these factors and explored the aggregate effect for efficient cache 
replacement in transcoding proxies. However, they considered only the problem 
of cache replacement and have not involved any issues on cache consistence. We 
argue that cache consistence has great influence on cache design. Consequently, 
it is of particular practical necessity to address the problem of cache mainte- 
nance by including both the cache replacement and consistence algorithms and 
the new emerging factors in the transcoding proxy. In this paper, we propose 
an original cache maintenance algorithm for transcoding proxy caching, which 
integrates both the cache replacement and consistence algorithms. Specifically, 
we formulate a generalized cost saving function to evaluate the profit of caching 
a multimedia object. Our algorithm evicts the objects based on the generalized 
cost saving to fetch each object into the cache. Consequently, the objects with 
less generalized cost saving are to be removed from the cache. On the other hand, 
our algorithm also considers the validation and write rates of the objects, which 
is of considerable importance for a cache maintenance algorithm. We evaluate 
our algorithm on different performance metrics through extensive simulation ex- 
periments and compare our algorithm with other algorithms proposed in the 
literature. 

The remainder of this paper is structured as follow: We present a cache 
maintenance algorithm in transcoding proxies in Section 2. The simulation and 
performance evaluation are described in Section 3 and Section 4, respectively. 
Section 5 summarizes our work and concludes the paper. 



2 A Cache Maintenance Algorithm in Transcoding 
Caching 

The relationship among different versions of a multimedia objects can be ex- 
pressed by a weighted transcoding graph [5]. Let Oij denote version j of object 
i. is the transcoding cost from version i to version j. The reference rates 

to different versions of objects, denoted by fij, are assumed to be statistically 
independent, where fij is the mean reference rate to version j of object i. Xij 
is the read cost of version j of object i from the server, iiij is the write cost of 
version j of object i, ijij is the cost of validating the consistency of version j of 
object i, and pij is the probability of invalidating version j of object i cached. 
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First we calculate the cost saving from caching only one version of an object 
in the transcoding cache (no other versions are cached) . From the standpoint of 
clients, an optimal cache replacement algorithm should maximize the cost saving 
from caching multiple copies of objects by considering both the read cost and 
the write cost. Thus, the individual cost saving of caching only Oij is defined as 
follows. 

Definition 1. CS{oij) is a function for calculating the individual cost saving 
of caching Oij, while no other versions of object i are cached. 

CS{oij)= ^ 

xeD{j) 

where D{j) is the set of versions that can be transcoded from version j. 

In Equation (1), di^x is the cost of reading or writing Oi^x from the server and 
u){l,x) is the cost of transcoding from the original version to version x if Oi^ is 
not cached. On the other hand, ojj^x is the cost of transcoding from version j to 
version x, Xi^x is the read rate of Oi^x from the client, and pij is the write rate 
of Oij from the server if Oij is cached. 

As a matter of fact, there may be many versions of an object that can be 
cached at the same time if this is valuable. In the following we discuss the 
aggregate cost saving of caching multiple versions of an object. The aggregate 
cost saving of caching multiple versions of an object at the same time can be 
defined as below. 

Definition 2. is a function for calculating the aggre- 

gate cost saving of caching Oij^, ■ ■ ■, Oij^. . 

C>S'(oiji , Oij'2 J ■ ■ ■ 5 Oi.ifc ) = ^ ^ ^ ' Xi^xidi^x 

xeD{y) ( 2 ) 

Tu^(l, x) Uj(y, x) Pi.ydi^x^ fXi,y^i,y^ 

Now we define the marginal cost saving of caching a version of object i if 
there is at least one version cached. 

Definition 3. CS{oi,j\oij.^^oij^,...,oij^) is a function for calculating the marginal 
cost saving of caching Oij, given that Oij.^, Oij^, Oij^ are already cached, 
where j yf Ji, J 2 , • ’ ’ , Jfe- 

CS{oij |oi , Oi , • • • , Oi 

CS{oij , Oi , Oij2 J ■ ■ ■ ) ~ C'S'(oi ,Oi^j,^, - ■ ■ , Oij ,. ) 

If we use Sij to denote the size of Oij, then we formulate the generalized cost 
saving function as follows: 



CS^io.j) = 



if no other versions are cached 

CS(Ot,j |Oj Ji ,Oj J2 Jfc ) 



if. 






are cached 



( 4 ) 
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It is easy to see that the generalized cost saving function is further normalized 
by the size of Oij to reflect the object size factor. The rationale behind this 
normalization is to order the objects by the ratio of cost saving to object size. 
The generalized cost saving function defined in Equation (4) explicitly takes into 
consideration the new emerging factors in transcoding caching and the aggregate 
effect of cache multiple versions of an object. Importantly, it takes into account 
not only the read cost but also the write cost. 

Based on this function, we propose our cache replacement scheme as follows. 
Suppose the size of a new object to be cached is s, then we should And a subset 
of objects O* = {o/i,gi, 0 / 2 , 32 , • • • , o/,,^,} C O that satisfles the following condi- 
tions. Here O = {Oip, 01 , 2 , • • • , Oi/^ , 02 , 1 , 02 , 2 , • • • j 02 , 1 ^, • • • , Om,l, Om, 2 , • • • , Om,/^} 
is the set of objects cached. 

(1) H o/,g < s 

of,g^O* 

(2) ^ CS^{ofJ< Y. CS^(of^g), V O C O that satisfles (1) 

o/,s60* o/.sSO' 

Obviously, (1) is to make enough room for the new object, and (2) is to evict 
those objects whose total cost saving is minimized. 

With the two conditions above, we can devise the pseudocode of our scheme 
as follows. 

Algorithm GCS {C, Sc, Su,Oij) 

1 add Oij into C 

2 recalulate the generalized cost saving of each version of object i 

3 BuildHeap(C) 

4 while Su — Sc < s do 

5 Remove the first object from C 

Su — Su ^ f,g 

7 recalulate the generalized cost saving of each version of object i 

8 BuildHeap(C) 

In Algorithm GCS, C is used to hold the cached objects. Sc is the cache 
capacity, Su is the cache capacity used, and Oij is the object to be cached. For 
this algorithm, we can see that the most important thing is to And the objects 
with minimal cost saving. 

It can be shown that the time complexity of Algorithm GCS is 0{S“^log{S)), 
where S is the number of different objects cached. However, from the algorithm 
we know that we have to search the entire cache for the other versions of the 
object and then recalculate the generalized cost saving for them whenever we 
insert or evict an object into or from the cache. Such operations are, in general, 
very costly. Here we apply the data structure proposed in [5] to facilitate such 
operations. 

In the actual implementation, the parameters for computing the generalized 
cost saving are usually not constant. To realize our algorithm, these parameters 
may have to be relaxed. Here, we adopt a “sliding window” technique [15] which 




Cache Design for Transcoding Proxy Caching 191 



has been widely applied. It combines both the history data and the current value 
to estimate the parameters. Specially, the parameters are estimated as follow. 

a) • 



where d"®™ is the newly measured cost of reading or writing Oij from the client 
or the server and is the measured cost of reading or writing Oij from the 
client or the server last time; ti^j is the time when the new request to Oij is 
received from the client and is the time when the last K request is received 
from the client; Si^ is the time when the new update to Oij is sent from the 
server and is the time when the last K update is sent from the server, is 
considered as a constant since it just sends an invalidation message to the server 
for all the documents. We estimate pi^j by yTj+yiTj- 

3 Simulation Model 

In the simulation, to generate the workload of clients’ requests, we model a 
single server that maintains a collection of m multimedia objects^. The object 
popularity followed a Zipf-like distribution [2] . Specifically, the popularity of the 
tth video was proportional to l/i“. The default values of m and a were set to 
be 1000 and 0.75 respectively. The sizes of the videos followed a heavy tailed 
distribution with the mean value of 12K Bytes [13]. The clients are divided five 
classes. Without loss of generality, we assume that the sizes of the five versions of 
each video to be 100 percent, 80 percent, 60 percent, 40 percent, and 20 percent 
of the original video size. The access probabilities of the clients are described as 
a vector of < 0.2,0.15,0.3,0.2,0.15 >. The transcoding relationship of the six 
versions is shown in Figure 1. 



dij — I 
— 



^new 

nrr 



+ (i- 



Mij — 



- 1<2 




Fig. 1. Transcoding Graph for Simulation 



Regarding the transcoding rate, we set it to be 20K bytes per second. The 
delays of fetching the videos from the server are given by an exponential distri- 

^ In the simulation, the multimedia objects are assumed to be videos. 
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bution. We assume that there is no correlation between the video size and the 
delay of fetching it from the server. This is justified by Shim et al. in [15]. 

The synthetic workloads are generated according to the recent results on the 
web workload characterization [6,8,13]. Table 1 lists the parameters and their 
values used in the simulation. 



Table 1. Parameters Used in Our Simulation 



Parameter 


Value 


Number of Nodes 


10000 


Delay of Fetching Objects 
Number of Multimedia Objects 


Exponential Distribution 
p{x) = {e = 0.45 Sec) 

1000 objects 


Web Object Size Distribution 


Pareto Distribution 
P(®) = — LI, b — 8596) 


Web Object Access Frequency 


Zipf-Like Distribution 
-L (i = 0.7) 


Average Request Rate Per Node 


17(1,9) requests per second 


Transcoding Rate 


20KB/Sec 



We compare our scheme with the following algorithms. (1) Least Recently 
Used {LRU) evicts the web object which was requested the least recently. The 
requested object is stored at each node through which the object passes. The 
cache purges one or more least recently requested objects to accommodate the 
new object if there is not enough room for it. (2) Least Normalized Cost Re- 
placement {LNC — R) [14] is an algorithm that approximates the optimal cache 
replacement algorithm. (3) Aggregate Effect (AE) [5] is an algorithm that ex- 
plores the aggregate effect of caching multiple versions of an object in the cache. 

4 Performance Evaluation 

The primary cache performance metric employed in the simulation is delay- 
saving ratio {DSR), which is defined as the fraction of communication and server 
delays which is saved by satisfying the references from the cache instead of 
the server. We also use average access latency (AST), object hit ratio {OHR) 
as secondary performance metrics. Here OHR is defined as the ratio of the 
number of requests satisfied by the caches as a whole to the total number of 
requests. We use staleness ratio {SR) as the primary consistency metric. The 
staleness ratio is defined as a fraction of cache hits which return stale objects. 
Here “stale” means that the time that an object was brought to the cache is less 
than the last-modified timestamp corresponding to the request. In the following 
figures LRU, LNC — R, and AE denote the results for the three algorithms, and 
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CERWC shows the results for the model of coordinated en-route web caching 
in transcoding proxies, as proposed in Section 2. 

In the experiments, we compare the performance of different models across a 
wide range of cache sizes, from 0.04 percent to 15.0 percent. The first experiment 
investigates DSR as a function of the relative cache size at each node and Figure 
2{A) shows the simulation results. MA gives on average 13.3% improvement over 
LRU and and 5.3% over LNC — R. The maximal improvement over LRU and 
LNC — R is 17.2% and 8.2% for cache size 0.5% and 2.0% respectively. On 
average. The DSR of MA is only 1.0% below that of AE. In the worst case, the 
DSR of MA is only 1.38% that of AE for cache size 10%. 

Figure 2{B) shows the results of OHR as a function of the relative cache size 
for different models. Although MA is not designed to maximize the object hot 
ratio, it still provides an improvement over LRU and LNC — R. In particular, 
the average improvement is 29.6% over LRU and 22.5% over LNC — R. The 
object hit ratio provides even closer to the hit ratio of AE; on average, 1.1% and 
no more than 2.39% below the object hit ratio of AE. 

In addition to improving performance of the cache, the MA algorithm also 
significantly improves irs consistence. On average, MA achieves a staleness ratio 
which is by factor of 3.2 better than that of AE, in the worst case it improves 
SR of AE by factor of 1.9 when the cache size is 0.5%. MA also improves SR 
over LRU and LNC — R. On average, MA achieves a staleness ratio which is 
50.8% better than that of LRU and 50.1% better than that of NC — R. In the 
worst case, it improves SR of LRU by 10.2% when the cache size is 0.5% and 
improves SR of LNC — R by 8% when the cache size is 2.0%. The staleness ratio 
comparison of the four algorithms can be found in Figure 2(C). 




Fig. 2. Experiments for DSR,OSR,andSR 



5 Conclusions 

In this paper, we proposed a maintenance algorithm for transcoding proxy 
caching, which combined both cache replacement and cache consistence. The 
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simulation indicated that our algorithm could significantly improve the stale- 
ness ratio, while keeping the cache performance within acceptable loss. This 

greatly benefit the cache design for transcoding proxy caching. 
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Abstract. The rapid expansion of the Internet accompanies a serious side ef- 
fect. Since there are too many information providers, it is very difficult to ob- 
tain the contents best fitting to customers’ needs. Web Syndication Services 
(WSS) are emerging as solutions to the information flooding problem. How- 
ever, even with its practical importance, WSS has not been much studied yet. In 
this paper, we propose the Content Aggregation Middleware (CAM). It pro- 
vides a WSS with a content gathering substratum effective in gathering and 
processing data from many different source sites. Using the CAM, WSS pro- 
vider can build up a new service without involving the details of complicated 
content aggregation procedures, and thus concentrate on developing the service 
logic. We describe the design, implementation, and performance of the CAM. 



1 Introduction 

The Internet has been growing exponentially for the past deeades, and has already 
beeome the major souree of information. The estimated number of Internet hosts 
reaehed 72 million in February 2000, and is expeeted to reaeh 1 billion by 2008[1]. 
However, sueh a rapid expansion aeeompatries a serious side effeet. Although users 
ean easily aeeess the Internet, it is very diffieult to obtain the eontents best fitting to 
their needs sinee there are too many information providers (Web hosts). This problem 
is usually ealled information flooding. 

Various Web Syndieation Serviees (WSS) (See Figure 1) are emerging as solu- 
tions to the information flooding problem. WSS is a new kind of Internet serviee 
whieh spans over distributed Web sites. It provides value-added information by proe- 
essing (e.g., integrating, eomparing, filtering, ete) eontents gathered from other Web 
sites. Priee eomparison serviee sueh as Shopping.eom [2] and travel eonsolidator 
serviee like Expedia.eom [3] ean be eonsidered as examples of the WSSs. To ordi- 
nary elients who are not familiar with a speeifie domain, a WSS targeting the domain 
would be of a great help to overeome the information flooding. 

Providing a WSS is teehnieally ehallenging. It is mueh more eomplieated than 
providing an ordinary serviee. However, even with its praetieal importanee, WSS has 
not been mueh studied yet. A system providing a WSS ean be seen into two parts; the 
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WSS service logic and the content aggregation subsystem. The content aggregation 
is the common core of many WSS’s while the service logic is service specific and 
differs from service to service. It receives requests from clients and interacts with 
source sites to process the requests. In this paper, we propose the Content Aggrega- 
tion Middleware (CAM). The CAM is an efficient content aggregation system de- 
signed to be a base for many WSS’s. Using the CAM, a service provider can easily 
develop and deploy a high performance WSS system supporting a large number of 
clients and source sites. 

We identify several requirements for a WSS site. First, a WSS site should support 
a high level of performance. The performance requirement in a WSS site is a lot 
higher than in ordinary Web sites. It should manage much larger number of requests 
from clients spread over the Internet. Additionally, it should handle a huge number of 
source sites and interactions with them. Second, it should support high dynamics of 
Internet environment. In a fully Internet-connected environment, real world events 
can be quickly reflected and propagated to systems. Once generated, the information 
will go through frequent changes. Third, a WSS site should deal with many source 
sites, which are highly heterogeneous. 

The CAM has been designed to meet the above requirements of a WSS site. It pro- 
vides a WSS with a content gathering substratum effective in gathering and process- 
ing data from many different source sites. Using the CAM, WSS provider can build 
up a new service without involving the details of complicated content aggregation 
procedures, and thus concentrate on developing the service logic. The CAM simpli- 
fies the complex procedure of interacting with content providers through a formalized 
service contract (SC). Also, it effectively masks the high level of heterogeneity 
among different source sites. In addition, it is a high performance system much relax- 
ing the burden of performance concerns in system development. Below, we describe 
the novel characteristics of the proposed content aggregation system. 

First, the CAM is a source data caching system along with basic data processing 
capabilities. It caches data in the form of source data, e.g., the unit of database fields 
as stored in content providers' databases. For value-added service, fine-grained con- 
trol on the cached contents gathered is required. Source data caching makes such 
fine-grained control possible. For data processing, basic functions such as content 
conversion, filtering, and query processing, are provided. 

Second, it is a high performance system. As mentioned, a WSS should handle a 
high rate of requests from lots of clients. In addition, it should be capable of manag- 
ing a lot of interactions with source sites to keep the freshness of cached data. With a 
source data caching, keeping cached data up-to-date can be done efficiently. Also, to 
manage a large volume of data efficiently, it uses main memory as a primary storage. 

Third, the CAM is equipped with real-time update capability. To keep the fresh- 
ness of cached contents, any modification on the data at source sites is propagated to 
the CAM as soon as possible. The update mechanism is based on server invalidation 
scheme; upon modification, the source site initiates invalidation and modification of 
the cached data in the CAM. In this way, the delay to the data update can be short- 
ened. 

Fourth, a wrapper is used to deal with the heterogeneity of the source sites. The 
CAM gathers contents from many different source sites. So, handling the different 
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sites in a uniform way is critical to the CAM system. By deploying a wrapper module 
to each source server, the CAM can handle different source sites in a uniform way. 

In this paper, we present the design and implementation of the CAM. We also 
show some measurement results to demonstrate the performance of the system. The 
current version of our system is designed for WSS interacting with typical Web sites. 
Thus, Web sites adopting new technologies such as XML and Web Services, are not 
considered in this paper. We believe that such emerging Web technologies can be 
easily incorporated to our system. 



s 

client 






airline sites 
al,a2,... 



□ 




Fig. 1. An Example of Web Syndication Services - Travel Consolidator Service 



This paper is organized as follows. The CAM architecture is described in Section 
2. A few challenging issues are discussed in Section 3. In Section 4, system perform- 
ance is discussed. We discuss related work in Section 5. Finally, we conclude our 
work in Section 6. 



2 Content Aggregation Middleware (CAM) Architecture 

A WSS can be constructed with a front-end WSS logic and a back-end CAM (See 
Fig. 2). The WSS logic implements service specific application logic. It is usually 
implemented as Web applications using JSP, Servlet, etc. It interacts with clients via 
Web server or application server to receive requests and deliver results. It also inter- 
acts with the CAM to request or to receive data required to construct result pages. 

The CAM has a modular structure, which consists of four components: Content 
Provider Wrapper (CPW), Content Provider Manager (CPM), Memory Cache Man- 
ager (MCM), and Memory Cache (MC). CPW runs on content provider sites and 
enables the CAM to access different content providers in an identical way. The other 
components are on the WSS site. CPM communicates with content providers and 
receives contents. MCM manages MC, which stores and manages the retrieved data. 
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2.1 Processing Flow 

The CAM mainly deals with two kinds of requests: eontent update request and eon- 
tent aeeess request. The request proeessing flows are shown in Fig. 2. 
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Fig. 2. CAM Architecture 



The update proeess is initiated when data is modified in the eontent provider's da- 
tabase. The eontent provider deteets and notifies the update event, ineluding table ID, 
field names, and modified values, to CPW. CPW reeeives the notifieation message. 
Then, it eonverts the data aeeording to the eonverting information and sends the eon- 
verted result to CPM. CPM reeeives the update message and forwards it to MCM. 
Finally, MCM replaees the data in the MC with the new data in the update message. 

The eontent aeeess proeess is initiated when elients request a serviee. Web appliea- 
tions implementing WSS logie retrieve data from MC and generate a serviee result. 
The Web applieations aeeess MC via a popular database interfaee sueh as JDBC or 
ODBC 



2.2 Deployment of WSS and the CAM 

In order to start a WSS with the CAM, eontent providers as well as the WSS provider 
need to partieipate in serviee deployment proeess. First, both the WSS provider and 
partieipating eontent providers should agree on how to interaet with eaeh other. See- 
ond, eontent providers need to install and eonfigure a CPW. We use Serviee Contraet 
(SC) to simplify the eonfiguration proeess. The SC represents a eolleetion of well- 
defined and externally visible rules whieh both human and maehine ean understand 
[4]. It is used as an enforeement meehanism for proper interaetions between the CAM 
and eontent providers. The strueture of the SC is shown in Fig. 3. 
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Service Contract 
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Fig. 3. Structure of the Service Contract 

After the SC is filled up and a CPW is installed in a content provider server, the 
content provider and the CAM configure their systems according to the SC. Since the 
SC contains the specification for all the interaction rules, the configuration is simply 
done by feeding the SC into the systems. At the content provider site, the CPW first 
parses the SC and then sets up related components such as the communication inter- 
faces.The CAM also parses the SC. Then, it notifies CPM's monitoring module and 
update-listening module of the new content provider. If needed, it also forwards con- 
figuration information such as valid actions, protocols, and addresses, to each mod- 
ule. Based on this information, the modules prepare themselves for the new content 
provider. Note that, in the proposed architecture, the re-configuration is easily done 
dynamically by feeding a new SC into the CAM and the contracted content provider. 



3 Design Challenges 

3.1 Instant Update Mechanism 

It is important to keep the data in the CAM up-to-date. Thus, any modifications of 
contents in the content provider’s database should be promptly reflected to those in 
the CAM. In addition, the update mechanism should be efficient since a high number 
of update requests are expected. 

The update scheme is based on server-push. The content provider server instantly 
identifies any modification in the database, and initiates an update in the CAM by 
sending out an invalidation message. Thus, an update is propagated to the CAM with 
very small delay. When sending an invalidation message, we piggyback the message 
with the modified field and value. Thus, an update can be completed with one mes- 
sage. 

The instant identification of content modification is done based on a trigger 
mechanism in the content provider’s database. Using a trigger mechanism, the update 
process can be done very efficiently. It is so since fine grained invalidation is possible 
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due to the use of the meehanism where ehanges ean be deteeted in the unit of a field. 
Trigger meehanisms are provided in many popular DBMS’s sueh as Oraele, DB2, 
and MySql . 




Fig. 4. Instant Update Mechanism and Its Procedure 



Currently, time-to-live (TTL) based sehemes are most popularly used as a eaehe 
eonsisteney meehanism in the Internet [8]. However, TTL-based sehemes are not 
proper for the CAM sinee they eannot quiekly propagate updates to a eaehe. Prompt 
propagation of updates may be aehieved if a eaehe frequently polls ehanges in servers 
in a very small interval. However, this will ineur exeessive overhead to the eaehe. On 
the eontrary, server-push style approaches can more quickly reflect changes in origi- 
nal data. 

Fig. 4 shows the detailed structure of CPW and the whole update process from da- 
tabase modification at content provider server to actual update at the CAM. When an 
update occurs at the content provider’s database (1), the trigger routine activates a 
trigger, here we named it as Event Reporter. (2). The Event Reporter sends the modi- 
fied information to CPW (3). The update information is received by the Event Lis- 
tener module in CPW. Then, Content Converter converts the schema and format of 
the received information (4) by referring to Content Conversion Table, if needed (5). 
The Content Converter makes an update message with the converted information (6). 
Then, the Communication Module sends the message to the CAM (7). As soon as 
CPM receives the update message, it forwards the message to MCM (8). Lastly, 
MCM constructs a proper query message based on the received message and commits 
the update transaction on MC, logging this event if required (9). 
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3.2 Template-Based, Safe Wrapper Mechanism 

In CPW design, safety should be importantly considered since it runs in foreign (i.e., 
content providers) servers. It may access confidential data, crash, or generate an error 
disturbing the server system. 

For safety, we propose a dynamically customizable wrapper. In our approach, a 
generic wrapper template is composed and used for every content provider. Since 
there is only one template, certifying the safety of the wrapper becomes easy. For 
instance, the safety can be certified by the third part agency. Once certified, the safety 
of the wrapper is assured for every content provider. Each wrapper instance is gener- 
ated from the template along with an SC. The instance will act as specified in the SC. 
Note that the SC is signed by the WSS provider as well as the content provider. 

We implement the wrapper using Java. Java-based implementation is advantageous 
in several ways. First, the module can be installed in any computing environment 
running Java Virtual Machine (JVM). Second, faults in a wrapper module do not 
affect the reliability of a content provider system. Faults in the wrapper module 
propagate only to the virtual machine. Third, by using the powerful access control 
mechanism of JAVA, content providers can prevent a wrapper from accessing their 
resources. 



4 Performance Evaluation 

The performance of the CAM prototype is evaluated using a prototype. For high 
performance, the CAM prototype was implemented mostly in C++ on Linux plat- 
form. Most of the MCM's functions, including system monitoring and logging, have 
been implemented. In the current prototype, the number of requests for each content 
object and that of messages from each content provider are monitored and logged. 
The current version of MC has been implemented by customizing the third party main 
memory database, ’’Altibase” [9]. This helps us quickly implement the prototype. To 
help content providers set up database triggers, we plan to provide templates and 
samples of the triggers for different DBMS’s. For the time being, those for the Oracle 
DBMS are provided 



4.1 Experimental Environment 

We assume that the CAM is deployed on a single node. The performance will in- 
crease when the CAM is deployed on multiple nodes. For the simplicity of measure- 
ment, clients and content providers are connected to the CAM via lOOM local area 
networks. Each node has a Pentium III IGhz CPU and 512MB main memory except 
the node for the CAM which has 2GB main memory. Red Hat Linux 7.2 is used as 
the operating system, and Sun JAVA 1.3 is used as the JVM. Apache 1.3.20 and 
Tomcat 3.2.3 is used for the Web server and the application server, respectively. In 
the rest of this section, we assume that all the cached contents fit in the main memory. 
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4.2 Workload and Measures 

The performance of the CAM prototype is evaluated via three different measures: (1) 
browse, (2) update, (3) mixed throughputs. The browse throughput is measured when 
requests are only from clients, while the update throughput is measured when re- 
quests are only from content providers. The mixed throughput is measured when the 
two types of requests are issued. 

To measure the performance of browse request processing, we use a transactional 
Web benchmark: TPC Benchmark^M w (TPC-W) [13]. It is commonly used to meas- 
ure the performance of a database-backed web serving system. We slightly modified 
the TPC-W benchmark. Originally, there are two kinds of interactions in the TPC-W 
specification: browsing and ordering. We use only browse interactions in our experi- 
ment since the browsing interaction is composed of database retrieval operations. 
Note that database scale factor is used to specify the scale of the measured web serv- 
ing system. To measure the update throughput, we made our own utility called update 
request generator. It generates and sends multiple update requests simultaneously, 
emulating the situation where several content providers update their contents at the 
same time. 



4.3 Performance Evaluation 

We measure the throughput and response time when database scale is 10k or 100k. 
Fig. 5 shows the throughput of five browsing interactions. Throughput is represented 
in WIPS - the number of interactions processed per second. The total WIPS, i.e., the 
summation of the WIPS for five interactions, is 77 and 8 when database scale is 10k 
or 100k, respectively. The response time measured from the same experiments shows 
that all requests are processed in 0.23 and 1 second when database scale is 10k or 
100k, respectively. 

Fig. 6 (a) shows the throughputs as the number of threads in the update request 
generator increases from one to ten; the number of threads represents the number of 
content providers sending updates simultaneously. We run the experiments when the 
update message size is 64 and 256 bytes. Although the update size would be arbitrary, 
we assume that the sizes of frequently changed database fields are not large. The 
number 64 is chosen since it is the smallest power of two larger than 38, which is the 
maximum digit of numeric variable in Oracle database. Similarly, 256 is the closest 
number to 255 which is the default size of char type in Oracle. The figure shows that 
the CAM processes about 400 requests per second. The number of active content 
providers or update message size has a negligible effect on the performance. 

To measure the mixed throughput, we kept sending a fixed number of update re- 
quests per second via the update request generator, and then measured the browse 
throughput via TCP-W. Fig. 6 (b) shows the results. For simplicity, the throughput is 
represented as the total WIPS. Note that from the previous experiments, the browse 
only throughput, i.e., browse throughput without any update, is 77 and the update 
only throughput is 41 1 . 
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Fig. 6. (a) Update Throughput, (b) Mixed Throughput 



5 Related Work 

Content aggregation tools such as Agentware [10], Active Data Exchange [11], and 
Enterprise Content Management Suite [12] help to retrieve and aggregate contents 
from multiple information sources. However, those tools are for an intra- 
organizational use, while the CAM is designed for inter-organizational use. 

Recently, a number of researches have proposed techniques for dynamic data cach- 
ing [5, 6, 7]. These techniques have been proposed mainly as the scalability solution 
for ordinary Web services, noting that the generation of dynamic data becomes a 
major bottleneck. The CAM is different in that it focuses on the provision of a WSS, 
which is a new type of cross-organizational data services, based on the cached infor- 
mation. The CAM is also different from others in that other caches can be considered 
as reverse proxies that are used within the contexts of specific servers, whereas the 
CAM is closer to a proxy that operates along with a number of content providers. 
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6 Conclusion 

A WSS system is composed of a WSS service logic and the content aggregation sub- 
system. The content aggregation is the common core of many WSS’s, while the ser- 
vice logic is service specific and differs from a service to another. We proposed a 
high performance content aggregation middleware called the CAM. The CAM pro- 
vides a WSS with a content gathering substratum effective in gathering and process- 
ing data from many different source sites. Using the CAM, WSS provider can build 
up a new service without involving the details of complicated content aggregation 
procedures, and thus concentrate on developing the service logic. 

The CAM is a source data caching system and makes possible fine-grained control 
of gathered contents. It is a high performance system capable of handling a high rate 
of request from lots of clients and content providers. Also, it uses main memory as a 
primary storage to efficiently manage a large volume of data. The CAM is equipped 
with real-time update capability to keep the freshness of cached contents. It is 
equipped with a wrapper to deal with the heterogeneity of the source sites. 

In this paper, we described the design and implementation of the CAM. We also 
showed the performance of the CAM prototype. We currently plan to further improve 
the performance of the system. 
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Abstract. The provision of location tracking for mobile agents is de- 
signed to deliver a message to a moving object in a network. Most 
tracking methods exploit relay stations that hold location information 
to forward messages to a target mobile agent. In this paper, we pro- 
pose an efficient location tracking method for mobile agents using the 
domain-based proxy as a relay station. The proxy in each domain is dy- 
namically determined when a mobile agent enters a new domain. The 
proposed method exploits the domain-based moving patterns of mobile 
agents and minimizes registration and message transfer costs in mobile 
agent systems. 



1 Introduction 

Mobile agents are software objects that can migrate across the network repre- 
senting users in various tasks. The most attractive applications are e-commerce, 
network management, and real-time control in many distributed system areas. 
The code mobility provides many advantages. When the data volume in a re- 
mote host is very big, mobile agent systems can save the network bandwidth 
tremendously. Instead of requesting whole data through network connection, a 
mobile agent migrates to the target host, filters through the data locally, and 
brings back only the result. For real-time control of remote devices, the tradi- 
tional client/server design is not a good candidate due to the irregular network 
delay. However, a mobile agent that has migrated to a remote system can di- 
rectly control the target system in real time. Mobile agents are also useful for 
applications in wireless environments, such as laptops or PDAs, that can be 
disconnected at short notice [1,2,3]. 

Apart from these advantages, there are many problems to be solved. Most 
of the research focuses on providing system support for the security of mobile 
agents, reliable communication with fast moving mobile agents, and efficient lo- 
cation management [4,5]. The typical application of mobile agent is to bypass 
the communication link and to exploit local access to resources on a remote 
server. Thus one may argue that the communication issue is not important. 
However, we have several situations that require efficient communication with 
mobile agents. For example, a user may launch a mobile agent with some param- 
eters directing the behavior of the agent and may want to change the parameters 
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later due to changes in the context that determined their creation [4]. Mobile 
agent systems should have location tracking functions to transfer messages only 
to target agents. Whenever a mobile agent migrates to a new node, the new 
location information should be registered somewhere in the system. 

In this paper, we propose an efficient location tracking method called Domain- 
Based Proxy. A domain consists of a group of hosts that are close to each other, 
measured by the number of hops in a network. Mobile agents can reduce the 
length of their migration paths by visiting the hosts in the same domain first, 
rather than selecting hosts randomly. The proposed method exploits the domain- 
based moving patterns of mobile agents and minimizes the registration and mes- 
sage delivery cost. We do not consider the chasing problem that occurs when 
mobile agents migrate so frequently that relay stations keep forwarding messages 
to hosts where the target agent no longer stays [4]. 

Section 2 explains background information in the field and motivation for this 
work, and section 3 explains the idea of domain-based proxy and its effectiveness 
in reducing the registration and message delivery costs. In section 4, we discuss 
simulation results with various parameters. Finally, we present our conclusions 
in section 5. 



2 Background and Motivation 

In recent years, there have been several protocols on location tracking of mobile 
agents. The common ground of these protocols is to have relay nodes that hold 
the current location of an agent and forward messages to it [5] . There are three 
different forms of relay nodes: a relay node that is fixed, a relay node that is 
movable, and a chain of relay nodes that are linked with a pointer. The relay 
nodes provide location transparent service to senders so that senders do not 
care about the current locations of agents and their movements. We assume 
that senders know the homes of mobile agents and the home nodes also act as 
relay nodes. 



2.1 Home 

The home node of a mobile agent carries the current location information of the 
agent and forwards messages from senders to the destination agent. Whenever 
a mobile agent migrates to a new host, it registers its current location with the 
home node. The protocol is simple, but the registration cost is high when the 
agent is far away from the home. Since there is no other relay node between the 
home and the destination node, the message delivery cost is low. 

2.2 Pointer Chain 

Each node on the migration path of a mobile agent keeps the pointer to the 
next node on the path. The home node becomes the first node in the pointer 
chain. When a mobile agent migrates between the nodes within a domain that 




Domain-Based Proxy for Efficient Location Tracking of Mobile Agents 207 



is far away from the home, the registration cost is low compared to the Home 
method. However, the message delivery cost becomes very high, since messages 
are forwarded through all the nodes on the chain. 



2.3 Mailbox 

Each mobile agent has a mailbox that relays messages to it. The agent regis- 
ters its current location with its mailbox whenever it moves to a new node. The 
mailbox decoupled from the agent can reside in different hosts and moves in- 
dependently [3]. Since it is movable, the home node should have the updated 
location information for the mailbox. After getting the current mailbox location 
from the home node, the sender delivers messages to its mailbox that can relay 
the messages to the target agent. The method of mailbox movement has not 
been published yet. Without an efficient mailbox movement, the performance 
will be nearly the same as the Home method. 



3 Domain-Based Proxy 

We define the term domain as a group of hosts that are close to each other in 
network structure. Each host belongs to only one domain. A proxy is determined 
at the time of entry to a new domain. The first host that a mobile agent visits 
in a new domain serves as a proxy in the domain. Whenever the mobile agent 
moves to a host in the same domain, it registers its location with the proxy in 
the domain. If the mobile agent migrates to a host in another domain, the host 
becomes a proxy in the new domain and registers with the proxy in the previous 
domain. Since another mobile agent can enter the same domain by visiting a 
different host, there may exist several proxies in one domain. A proxy has a data 
structure that points to the proxy of the next domain to which the mobile agent 
has already migrated, or points to the host which the mobile agent is currently 
visiting. We can lower the registration cost if hosts in a domain are close to each 
other. Messages are forwarded through the proxy chain and the last proxy in the 
chain forwards them to the host in which the mobile agent stays. 

Figure 1 shows the proxy chain after the mobile agent migrates to hig. It 
followed the path Pi, h2, ha, h4, P2, he, hy, P3, hg, and hig. Proxy Pi, Pg, and P3 
represent the first host the mobile agent visited in each domain respectively. The 
solid line indicates the message-forwarding path. The message-forwarding path 
is relatively shorter than that in the Pointer Chain method. Assuming that the 
inter-domain distances are relatively far, the registration costs within a domain 
are lower than those between domains. Thus we can reduce the registration cost 
for migrations within a domain. If the domain of the current host is equal to 
that of the previous host, two hosts are in the same domain. Hence, the current 
host will register with the proxy of this domain. If the domains are different, the 
current host becomes the proxy of the new domain. Consequently, the proxy of 
the new domain will be linked to the proxy of the previous domain. 
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Fig. 1. Migration in the Domain-Based Proxy scheme 



3.1 Compacting Proxy Chain 

Messages are forwarded through the proxy chain and delivered to the host where 
the target agent is. When the number of messages is high, the long proxy chain 
may monopolize the overall cost by overshadowing the low registration cost. 
Given the estimation of the number of messages, we can compact the proxy 
chain to reduce the message delivery cost. 

We define following parameters to describe compacting the proxy chain. 

~ N : the expected number of messages to receive 

— Dq : the distance from the home to the current node 

— Dp : the distance from the home to the current node through the proxy chain 

— Rq : the registration cost at the home node 

— Rp : the registration cost at the proxy node 

After executing the mobile agent many times, we may predict the expected 
number of messages to receive. We assume that each node knows the distance 
from all other nodes including the home node. To evaluate the distance Dp from 
the home to the current proxy through the proxy chain, each agent should carry 
the Dp-i. which denotes the distance from the home to the previous proxy 
through the chain. Distance Dq and Dp can determine the registration cost Rq 
and Rp respectively. With these parameters, we can estimate the registration 
and message delivery costs for the cases of proxy chain compacted and proxy 
chain without compacted. Since the migration within a domain does not change 
the proxy, only the case of migrating to a different domain requires the following 
decision on whether to compact or not. 



N * Dp Rp > A * Dq Rq 



Domain-Based Proxy for Efficient Location Tracking of Mobile Agents 209 




migrations per domain 



Fig. 2. Registration cost 

As an extreme case, an agent may migrate through a path such that Dp 
equals to Dq. Since Rp is definitely smaller than Rq for this case, keeping the 
proxy chain can reduce the overall costs. In most cases, the distance through the 
proxy chain. Dp, is larger than the direct distance, Dq, and the registration cost 
Rp is smaller than Rq. When no message is expected to arrive, the compaction 
is not necessary. However, as the expected number of messages increases, we can 
minimize the overall cost by compacting the proxy chain. 

4 Simulation and Discussion 

We assume that the network structure is in the form of a two dimensional grid 
and the location of each host is expressed by the coordinate (x, y) in the grid. 
The distance between two hosts can be calculated with the geometrical distance 
of two coordinates in the grid. The grid is partitioned to form domains. Each 
domain is also a square grid with smaller size. The costs for location registration 
and message delivery depend on the distance between two hosts and the data 
size. We assume the data size for registration is about a quarter of the data size 
in message delivery [5] . 

In the simulation, we calculate the registration and message delivery costs 
that occur in the host on the migration path. Starting from a randomly selected 
host in a domain, we continue to move to randomly selected hosts in the same 
domain until the number of migrations per domain is met. After completing a 
given number of migrations in a domain, we move to a new domain and visit 
hosts in the domain. Since the migration pattern can be different in various 
applications, the domains are selected randomly for simulation. We assume that 
all the hosts involved are lightly loaded and there is no additional delay in 
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migrations per domain 



Fig. 3. Message delivery cost 



processing messages. We repeat the above simulation by varying the number of 
migrations per domain and the number of messages. 

Figure 2 represents the registration costs against the number of migrations 
per domain. We observe the Proxy and the Pointer Chain method outperform 
the Home method as the number of migrations per domain becomes larger. The 
reason is that the inter-domain distances are greater than the distances between 
hosts in a domain. When the number of migrations per domain equals one, we 
do not expect any performance advantage either in the Proxy or the Pointer 
Chain method. Since the next domain to visit is selected randomly, the distance 
from the visiting domain to the home node will be comparable to the average 
inter-domain distance. 

Figure 3 shows the message delivery costs against the number of migrations 
per domain. Since the Pointer Chain method delivers messages through all the 
relay stations that are on the migration path, the message delivery cost increases 
linearly as the number of migrations per domain increases. However, the message 
delivery cost of the Proxy method remains constant because the proxy chain 
length does not increase even if the number of migrations per domain increases. 
The Home method delivers messages directly to the target agent without any 
relay station and the message delivery cost remains minimal. 

Figure 4 shows combined costs against the number of messages with the num- 
ber of migrations per domain fixed at 13. As the number of messages increases, 
the message delivery costs in the Proxy and the Pointer Chain method begin to 
dominate the registration cost and monopolize the combined cost respectively. 
Since the pointer chain length is longer than the proxy chain, the increase rate 
of the Pointer Chain method is steeper than that of the Proxy method. For 
the Home method, however, the registration cost still dominates the message 
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number of messages 



Fig. 4. Combined cost against the number of message delivery 



delivery cost and the rate of increase is unnoticeable. The Compaction method 
demonstrates low registration cost by exploiting the domain-based proxy and 
keeps message delivery cost low by maintaining a short proxy chain. Random 
selection of domains works favorably for the Compacting method. In real-life sit- 
uations, domains may be selected in a sorted order to reduce the migration path. 
If that is the case, we may need an elaborated compacting method to calculate 
the minimum path. 

5 Conclusion 

In this paper, we propose an efficient location tracking method for mobile agents 
using domain-based proxy. The domain-based proxy method can minimize regis- 
tration and message delivery costs by exploiting migration patterns. The proxy 
is determined when a mobile agent migrates to a new domain. The first host 
that a mobile agent visits in a new domain serves as a proxy in the domain. 
In the simulation, we calculated the registration and message delivery costs by 
changing the parameters such as the number of migrations per domain and the 
number of messages. Assuming that the hosts in a domain are close to each 
other, we can minimize the registration cost by exploiting the proxy within the 
domain and minimize the message delivery cost by compacting the proxy chain. 
Since the domains were selected randomly in the simulation, the proposed sim- 
ple compaction method is very effective in reducing the proxy chain length. In 
real-life situations, however, the domains may be selected in a sorted order to 
reduce the migration path. For a specific pattern of migration, we may need an 
elaborated compacting method to minimize costs. 
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Abstract. In this paper we present and evaluate Inhambu, a distributed object- 
oriented system that relies on dynamic monitoring to collect information about 
the availability of computational resources, providing the necessary support for 
the execution of data mining applications on clusters of PCs and workstations. 
We also describe a modified implementation of the data mining tool Weka, 
which executes the cross validation procedure in parallel with the support of 
Inhambu. We present preliminary tests, showing that performance gains can be 
obtained for computationally expensive data mining algorithms, even when 
running with small datasets.' 



1 Introduction 

Knowledge diseovery in databases is the non-trivial proeess of identifying valid, 
novel, potentially useful, and ultimately understandable patterns in data [1]. Data 
Mining (DM) is a step in this proeess that eenters on the automated learning of new 
faets and relationships in data, whieh eonsists of three basie steps: data preparation, 
information diseovery and analysis of the mining algorithm output. All of these three 
steps exploit huge amounts of data and are eomputationally expensive. In this sense, 
several teehniques have been proposed to improve the performanee of DM appliea- 
tions, sueh as parallel proeessing [2] and implementations based on eluster of work- 
stations [3] and eomputational grids [9]. These teehniques ean leverage the deploy- 
ment of DM applieations to produetion seales. 

Building eomputer elusters for high performanee eomputing has gained inereas- 
ingly aeeeptanee during the last few years. Aeeording to the November 2003 ’s list of 
the top500 supereomputer sites [4], 208 were elusters, whereas only 93 elusters ap- 



* This project is partially granted by CNPq (Conselho Nacional de Desenvolvimento Cientifico 
e Tecnologico, Brazil), under contract number 401439/2003-8. 

H. Jin et al. (Eds.): NPC 2004, LNCS 3222, pp. 213-220, 2004. 

© IFIP International Federation for Information Processing 2004 




214 



H. Senger et al. 



peared 12 months before. Assembling eomputer elusters eomprised by eommodity 
PCs or workstations is an easy roadmap to provide eomputational power at low eost. 
The effeetive use of elusters foreibly passes thru the availability of software tools 
eapable to support the effieient usage of their resourees. In networks of workstations 
or PCs, individual eomputers may present low indiees of utilization of eomputational 
resourees, so that idle resourees ean be used for the exeeution of proeessing and 
memory intensive applieations. In this sense, seheduling polieies should involve dy- 
namie deteetion and alloeation of idle resourees.. 

In this paper, we present and evaluate Inhambu, a distributed objeet-oriented sys- 
tem that provides load monitoring and deteetion of idle resourees in the eomputers 
that are part of the eluster. The adopted polieies take into aeeount real situations in 
whieh eomputers may be heterogeneous and the availability of resourees may fluetu- 
ate due to presenee or absenee of loeal users. Although Inhambu ean be used to pro- 
vide resouree management to other applieations, in the eontext of this projeet it is 
used for supporting the exeeution of the Weka System [5], whieh is an open souree 
software for data mining. 

The remainder of this paper is organized as follows: Seetion 2 introduees the arehi- 
teeture and polieies implemented by Inhambu, whereas Seetion 3 outlines the ehanges 
made to Weka in order to run data mining tasks in a parallel fashion. Seetion 4 goes 
on to present performanee results, and Seetion 5 outlines some related work. Finally, 
Seetion 6 summarizes our results and outlines future work. 



2 Overview of Inhambu 

This seetion briefly deseribes the main eomponents and strategies eomprising our 
system named Inhambu, whose main eomponents are depieted in Figure I . Thru this 
arehiteeture, Inhambu implements an extended trading service whieh ean support 
resouree management funetionalities to the interaetion among elient/server programs 
written in Java. The trader enables server programs to publish their serviees, by in- 
voking its exportService operation and passing serviee names and remote objeet ref- 
erenees. Sueh information remains stored in the trader, whieh ean be queried by eli- 
ent programs by means of its importService operation. Clients must import referenees 
to remote objeets that implement the applieation serviees before they ean invoke 
them. Clients ean be either applieation programs, or the Weka’s user interfaee. 




Fig. 1. The Trader Model, implemented by Inhambu. 





Inhambu: Data Mining Using Idle Cycles in Clusters of PCs 



215 



During the execution of the import operation, a resource management policy that 
aims at minimizing the execution times is used. Such policy uses information about 
idle resources, (e.g. CPU, memory) which is periodically received from monitoring 
agents placed on every server computer. In summary, our policy looks for the best 
computer which implements the requested service. If such server can be found, then 
an object reference is returned to the requesting client. If no good server currently 
implements the requested service, the trader looks for another good computer to in- 
stantiate a new object that implements the service (which could be the first instance 
for this service type, or a new replica of existing ones). In all cases, the server selec- 
tion procedure looks for the computer capable to execute the service in the shortest 
time, taking into account its processing power and current load. A more detailed de- 
scription of the system’s architecture and operation was presented in [6]. 



3 Parallelizing Cross Validation with Inhambu 

Weka [5] is currently one of the most popular DM tools. It is an open source software 
developed by researchers at The University of Waikato, New Zealand, and issued 
under the GNU General Public License (GPL). Weka is written in Java, being cur- 
rently available to Windows, MAC OS and Linux platforms. In a nutshell, Weka 
provides implementations of several algorithms for data mining. Its current version 
contains implementations of 71 algorithms for classification, 5 algorithms for cluster- 
ing, 2 algorithms for association, and 12 algorithms for attribute selection. In fact, it 
is continually growing, incorporating more and more data mining algorithms. All 
these algorithms can be either applied directly to a dataset or called from Java codes. 

Classification algorithms, also called classifiers, are predominant in relation to the 
other ones implemented in Weka. Therefore, we decided to first investigate potential 
benefits that Inhambu could bring in relation to the classifiers package. In this sense, 
the standard way of predicting the error rate of a classifier given a simple, fixed sam- 
ple of data is to use stratified tenfold cross validation [5]. In the tenfold cross valida- 
tion process, the dataset is divided into ten equal parts (folds) and the classifier is 
trained in nine parts (folds) and tested in the other one. This procedure is repeated in 
ten different training sets and the estimated error rate is the average in the test sets. 
Clearly, these experiments can be performed in a parallel way, taking advantage of 
Inhambu. It is important to emphasize that all classifiers implemented in Weka can 
benefit from such approach. 

In this paper, we focus on the parallelization of the cross validation by distributing 
different folds to execute in different nodes of the cluster. To accomplish this, we 
modified the classifiers package. The most important class in this package is the 
Classifier, which defines the general structure of any scheme for classification. 
All classifier algorithms are implemented as subclasses of classifier. It contains two 
methods, named buildClassif ier ( ) and classif yinstance ( ) , which 
must be implemented for each classifier method, so that the whole scheme redefines 
them according to how it builds a classifier and how it classifies data instances. The 
classifier creates a uniform interface for building and using all the classifiers meth- 
ods, so that the same evaluation module can be used to evaluate the performance and 
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accuracy of any classifier in Weka. The evaluation module is implemented by the 
class Evaluation, whose method crossValidateModel ( ) implements the strati- 
fied n-fold cross validation procedure, in which the dataset is divided into n equal 
parts. Then, the algorithm is trained in n-1 parts and tested in the remaining one. This 
procedure is repeated in n different training sets and the estimated error rate is the 
average in the test sets. Clearly, these experiments are independent and can be per- 
formed in parallel by up to n computers of the cluster. Although the number of folds 
in cross validation can be chosen by the user, the tenfold cross validation (i.e., setting 
n=10) is popularly known as a good practice among data mining practitioners and 
researchers. If no other value to n is provided by the user, Weka assumes ten (by 
default). 



4 Performance Tests 

In order to evaluate the performance of Inhambu, we have performed several simula- 
tions by means of two classifiers that are popular in the data mining community: 
PART and Multilayer Perceptrons. In summary, the PART classifier provides rules 
from pruned partial decision trees [5]. Multilayer perceptrons are feedforward neural 
networks that learn by means of backpropagation algorithms [10], which are gradient 
descent techniques with backward error (gradient) propagation. Our simulations were 
performed in three datasets that are benchmarks for data mining methods: Iris Plants, 
Wisconsin Breast Cancer, and Congressional Voting Records. These datasets are 
available at the UCI Machine Learning Repository [7] and describe classification 
problems. The Iris Plants dataset consists of three classes (Setosa, Versicolour and 
Virginica), each one formed by 50 examples of plants. Each plant is described by four 
continuous attributes (sepal and petal length and width). In the Wisconsin dataset, 
each object has nine ordinal attributes and an associated class label (benign or malig- 
nant). The total number of objects is 699 (458 benign and 241 malignant), of which 
16 have a single missing feature. We removed those 16 objects and used the remain- 
ing ones. The Congressional Voting Records dataset includes votes for each of the 
U.S. House of Representatives Congressmen on 16 key votes (attributes). There are 
435 instances (267 democrats, 168 republicans) and each of the 16 attributes is Boo- 
lean valued. However, there are 203 instances with missing values. These instances 
were removed and we employed the 232 remaining ones in our simulations. 

In our simulations, we have employed Weka 3.4.1, the most recent version by the 
time of this writing. For remote method invocation, Inhambu currently uses the 
Java/RMI platform, which allows the invocation of either local or remote methods 
transparently. A test scenario was created that implements a remote cross validation 
service to be executed on a replicated pool of server hosts. In order to enable the 
cross validation to execute in parallel, we implemented a new class named Paral- 
lelEvaluation, which inherits the functionalities of the class Evaluation.. This new 
class creates a pool of local threads that implement the preparation of datasets, the 
invocation of remote cross validation services, and gathering of the results. 
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Number of nodes 



Fig. 2. The execution times for the PART algorithm. 

To execute performance tests, we used a cluster composed of 16 PCs intercon- 
nected by a 100 Mbps Ethernet switch. Each cluster node is a Celeron 433 MHz sin- 
gle processor with 128KB of cache, 128 MB of memory, running Linux operating 
system and JDK 1.4.1. Initially, we executed the PART algorithm to analyze the 
datasets mentioned above in 1, 2, 4, 6, 8 and 10 nodes of the cluster. The one node 
test refers to a sequential execution with the original, non-modified Weka software. 
Each test was carried out ten times, and the average is illustrated in Figure 2. As one 
can see, there was no advantage in employing parallel executions to run PART in 
these datasets. Notice that the execution time with 2 processing nodes is around 8 
times slower than local execution. This behavior is typical for PART as well as for 
other lightweighted classifier algorithms when small benchmark datasets are em- 
ployed. It is due to the overhead of transmitting one copy of the whole dataset to ten 
processing nodes (for tenfold cross validation) via RMI. For each invocation, threads 
are created to manage the transmission process, marshalling and unmarshalling, ma- 
nipulation of buffers, and so on. The overhead in this case is not paid by the short 
execution times (which are less than 0.2 or 0.3 seconds for this example). However, it 
is likely that performance gains can be obtained when applying PART to real-world 
datasets. 

Another set of experiments was carried out in the same environment to analyze the 
same datasets, but using the algorithm Multilayer Perceptrons instead. The execution 
times for these experiments are depicted in Figure 3. In this case, the overall execu- 
tion time can be reduced by a factor of 3 or 4, for a cluster with 10 processing nodes. 
It is important to emphasize that, in real-world data mining applications, in which 
huge databases are common, the performance gains are likely to be even more rele- 
vant. 

In these experiments, all the computers were dedicated to execute our experiments. 
This measure assures the test is not influenced by other users or applications. Al- 
though in real situations the computers are not dedicated and may present fluctuating 
loads, Inhambu acts by selecting only those nodes which are currently idle, or present 
good conditions to execute tasks. 
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Fig. 3. The execution times for Multilayer Perceptrons. 



5 Related Work 

Weka-Parallel [8] is also a project that aims at providing a modified, parallel imple- 
mentation of Weka. Weka-Parallel leverages parallel processing by distributing dif- 
ferent folds of the cross validation to different computers of a local network. The 
similarities between Inhambu and Weka-Parallel are evident, since they either focus 
on parallelizing cross validation, and they operate on local networks of computers. 
However, some important differences should be highlighted here. Weka-Parallel uses 
sockets as the inter-process communication mechanism, whereas Inhambu uses RMI. 
The later allows Weka methods to being invoked either locally or remotely with full 
transparency from location details, and without the need to modify the Weka’s 
classes. This can minimize the handy work to customize new versions of Weka to 
work on the top of Inhambu. In addition, Inhambu provides some important perform- 
ance management functionalities, such as the capacity to deal with load fluctuations 
as well as the heterogeneity of the computers in the cluster. In contrast, Weka-Parallel 
implements a round robin scheduling policy, which does not consider neither the 
current utilization of the machines nor the differentiated processing capacities. Fi- 
nally, it is worth to notice that Inhambu aims at supporting the efficient execution of a 
wide range of high performance applications, not restricted to Data Mining. In the 
context of this project, however, both Inhambu and Weka are being customized to 
work together. 



6 Conclusion and Future Work 

Data mining applications exploit huge amounts of data and are computationally ex- 
pensive, demanding high quantities of computational resources such as processor 
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cycles and main memory. Clusters comprised by commodity PCs interconnected by a 
local network can provide these resources at low cost for data mining practitioners. 
However, managing the heterogeneity and the fluctuating load of the computers in the 
cluster are necessary for the effective utilization of clusters of PCs. Inhambu imple- 
ments mechanisms and policies that rely on dynamic monitoring to collect informa- 
tion about the availability of resources, providing the necessary support for the execu- 
tion of data mining applications on clusters of PCs and workstations. 

In this paper we describe a straightforward scheme to execute cross validation in 
parallel. Parallelizing cross validation is worthwhile, because it is widely used to 
evaluate data mining algorithms and it is the most expensive part in this process. In 
addition, our parallel cross validation scheme can be applied for all the 71 classifica- 
tion algorithms currently implemented by Weka (which implements 90 algorithms). 
However, deciding to use or not to use parallel processing to execute data mining 
applications may not be trivial. Due to loose coupling of the architecture, only the 
execution of sufficiently expensive tasks can be advantageous. Our experiments 
showed that, depending on the size of the dataset, the parallel execution of PART 
may not be advantageous, and suggests that it can be a typical behavior for other 
lightweighted algorithms like this one. However, we believe that parallel execution 
may be advantageous even for lightweighted algorithms, depending on the character- 
istics of the dataset, such as the number of instances, number and type of attributes, 
and the potential patterns that can be found by data mining. Our experiments also 
showed that more computationally expensive algorithms like the Multilayer Percep- 
trons always provide performance gains, even with small benchmark datasets. Be- 
sides, another common situation in data mining involves evaluating the error rate of 
several classifiers before choosing the best one, according to a particular application 
and dataset. In this situation, several cross validations could be performed in parallel, 
one for each classifier, increasing the degree of parallelism and potential gains. 

It is worth to notice that the scheme proposed here, to execute cross validation in 
parallel over Inhambu, is quite straightforward and obvious. No optimization was 
used. However, one can notice that in parallel tenfold cross validation, the same data- 
set is sent up to ten different nodes. Clearly, some scheme to group tasks to execute in 
the same node of the cluster can reduce the transfer of datasets over the network. 
Such a scheme may be particularly more advantageous when several cross validation 
processes are managed to execute in parallel (e.g. to perform the evaluation of several 
classifiers in parallel). In the near future, we are going to investigate this approach, as 
well as its application to real-world databases, in which huge amounts of data are 
processed. 
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Abstract. The complexity of services and applications provided by Web sites is 
ever increasing as integration of traditional Web publishing sites with new 
paradigms, i.e., e-commerce. Each dynamic Web page is difficult to estimate its 
execution load even with the information from the application layer. In this pa- 
per the execution latency at Web server is exploited in order to balance loads 
from dynamic Web pages. With only the information such as IP address and 
port number for Layer-4 Web switch the proposed algorithm balances the loads 
with a new load-update mechanism. The mechanism uses the report packets 
more efficiently with the same communication cost. Moreover the proposed al- 
gorithm considers the fairness for Web clients hence the Web clients would ex- 
perience higher quality of service. 



1. Introduction 



Web service is the most prevalent Internet service and its importance and usage gets 
higher as years go. For large numbers of Web client requests are headed to a popular 
Web site in peak times, most of the sites form multiple nodes into one Web server 
cluster. In these systems, any client Web request to the system is presented to a front- 
end server that acts as a representative for the system. This is called Web switch re- 
tains transparency of the parallel architecture for the user, guarantees backward com- 
patibility with Internet protocols and standards, and distributes all Web client requests 
to the back-end Web servers. Web server cluster in this paper collectively indicates 
this formation of a Web switch and Web servers, as illustrated in Fig. 1. The Web 
switch should distribute incoming requests to Web servers in load-balanced fashion. 
With only the information such as IP address and port number it seems some limit ex- 
its to develop request distribution algorithm load-balancing. Moreover most of Web 
pages are incorporated with executing scripts such as Java, PHP and so on, load- 
balancing becomes much difficult with requests for those dynamic Web pages. Due to 
variant execution latencies of the executing scripts, load-balancing in distributing re- 
quests for dynamic Web pages need consider fairness among Web clients. In this pa- 
per we propose a request distribution algorithm with a new load-update mechanism. 
Although so many algorithms have been proposed, in our best knowledge, no previ- 
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ous work has suggested the load-update meehanism. In the next seetion the kinds of 
information for load-balancing is introduced and the fairness among the Web clients 
is described. The load-update mechanism is suggested in developing a request distri- 
bution algorithm in Section 3. The proposed algorithm is compared with the related 
previous works in a sense of load-update mechanism in Section 4 and the simulation 
results are presented in Section 5. The last section concludes with the effect of the 
new load-update mechanism. 




Web Client Web Switch Web Server 




Latency 
at Web 
server 



Fig. 1. Web Server Cluster with Isolated Fig. 2. Web Service Protocol of the Cluster 
System Network 



2, Web Switch and Request Distribution for Fairness 



For the performance feature of the Web server cluster the Web switch should distrib- 
ute the requests to the Web servers so the loads of Web servers are balanced in the 
cluster. According to the OSI protocol stack layer at which the Web switch operates, 
Web switches are broadly classified into Layer-4 and Layer-7 Web switches. [1] The 
Layer-4 Web switch has the only information related to TCP/IP layers, thus the in- 
formation such as IP address and port number. The Layer-7 Web switch parses the re- 
quest then gets information of up to the application layer, thus the information such as 
URL contents, SSL identifiers and cookies. The Layer-4 Web switch is not aware of 
content information whereas the Layer-7 Web switch is. As much information is sup- 
plied, the Layer-7 switch is capable for more accurate decision in distributing the re- 
quests. However the Layer-7 Web switch introduces severe processing overhead to 
the extent that may cause the Web switch to severely limit scalability of the Web 
server cluster. In [5], the peak throughput achieved by a Layer-7 Web switch is lim- 
ited to 3,500 connections per second, while software based Layer-4 Web switch im- 
plemented on the same hardware is able to sustain a throughput up to 20,000 connec- 
tions per second. For this reason we propose a distributing algorithm and the system 
organization found on the Layer-4 Web switch. 
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Many works on load balancing in distributing requests are conducted for the Web 
server cluster. [2, 3, 4] The works considers mainly the static Web pages rather than 
the dynamic Web pages with executing scripts. Nowadays there are much of dynamic 
Web pages of Java, PHP, ASP and so on. Those Web pages are difficult to estimate 
the load even with the information from the application layer. Although there might 
exist such an estimation algorithm, the Web switch cannot use highly sophisticated 
algorithms in distributing requests since it has to take immediate decision for hun- 
dreds or thousand of requests per second. We need a simple algorithm for this reason. 

Typical service protocol of the Web server cluster is illustrated in Fig. 2. When the 
Web switch received the Web client request, it determines whether the request is from 
current connected Web client or for new connection to a Web server. In the case the 
request is of current connection by a hash function at Web switch, the request is re- 
layed to the connected Web server. Otherwise, the distribution algorithm selects a 
Web server for new connection. After processed at the Web server the response of the 
request is routed to the Web client. As depicted in Fig. 2, the response time is com- 
posed of four parts; 



tResponse — tOutside -F toistribution -F tlnside -F tProcessing (1) 

tResponse is the time elapsed after it send its request till the Web client starts receiving 
the Web server’s response. The client request takes a half of toutsMe to reach the 
Web switch, and the server response takes the other half of toutside to reach the cli- 
ent. This is the time elapsed outside the system only. The Web switch received the re- 
quest decides the Web server by the distribution algorithm or a hash function for 
toistribution . The request is then relayed to the Web server in a half of tinside and 
processed at the Web server for tProcessing . After processed the response takes the 
other half of the tinside to leave the system, in other words, to reach the bridge/router 
in fig. 2. Among those parts, toutside is dependent on the location of the Web client. 
Thus it is variant and unable to reduce at the system. Reducing tinside is not relevant 
to the distribution algorithm thus out of the scope of this paper. 

To reduce toistribution, the latency at Web switch it is necessary a fast and simple 
distribution algorithm. Balancing the loads of each Web server would reduce 
tProcessing , the latency at Web server, although it is directly dependent on the Web 
server’s capacity. The latency at the Web switch is common for every Web client. 
Thus equalizing the average of latencies at the Web servers should be fair for the Web 
clients with respect to the response time. For toutside is variant by each Web client’s 
distance from the system, the response times are not equal in the reality, however we 
appreciate it is also fair for all the clients. Therefore we use the execution latencies for 
comparison of each server’s load to evaluate the load-balancing. Exploiting the re- 
sponse time is nothing new for load balancing[6], however the latency at Web server 
is not exactly the same with the response time. Moreover distribution algorithm we 
propose has a particular load-update mechanism. The new algorithm works far differ- 
ently from other algorithms proposed so far. 
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3, Load-Balancing with a New Load-Update Mechanism 



Conventionally the Web switeh of the eluster gathers information of eaeh server’s 
load periodieally. All the servers in the eluster reports their load information to the 
Web switeh at any given rate, and the load-balaneing algorithms used this periodic 
load-update meehanism. This meehanism is good for updating simultaneously aetual 
loads of all the servers sinee the reporting is synehronized by all the servers. However 
this meehanism is not good for system sealability sinee the number of report paekets 
eoneentrated to the Web server grows as the number of the servers increases. In this 
section we introduce a new load-update mechanism that the reporting is not synchro- 
nized among the servers. 

The objective of load-balancing algorithm is to keep even loads among the servers. 
Previous studies have suggested that the run-queue length best describes a server’s 
load, and many load-balancing algorithms have adopted this metric[9]. We focus on 
execution latencies thus we adopt this metric. Our basic idea is once sending equal 
numbers of requests to each server and then let the server having less requests report 
its Tessness’. 



Load-balancing Algorithm 1. High-Communication-Cost Model 

For every request packet arriving at the Web switch; 

1. The Web switch merely distributes the income re- 
quests to servers in traditional 'Round-Robin' . Equal 
numbers of requests are executing in the servers. 

2. When a request finishes its execution the server 
that processes the request immediately reports that one 
request has finished. 

3 . The Web switch subtracts one from the load value of 
the reported server in the Load Table . 

4 . IF the load values are not all equal the Web switch 
finds the lowest value and sends one request to the 
server, 

ELSE the Web switch sends the request in 'Round-Robin' 
order . 

5. Whenever the Web switch sends one request, it adds 
one to the load value of the target server in the Load 
Table. Continue at Line 2. 

Line 4 of the algorithm guarantees the Web switch keep the numbers of executing re- 
quests equal among servers. This algorithm is quite simple and works nicely. There 
are two conditions of early finish; the execution length of the request was shorter in 
itself, or the request shared resources with fewer other requests, i.e. CPU. While most 
of other requests are in lO-phase the requests in CPU-phase gets more CPU times. 
Each server is processing equal number of requests at any instant, however the 
throughput of each server is different. Since the Web switch does not have enough 
time to reschedule income requests considering efficient overlap of one request’s 
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CPU-phase and other request’s lO-phase, this algorithm should show ideal load- 
balancing. The algorithm is compared with previous works in Section 4. 

For the algorithm to work each request’s finish must be immediately reported to 
the Web switch. We named this load-information update mechanism Update-on- 
Finish. This update is neither periodic nor synchronized among servers. If the re- 
quests received from the Web switch are n requests, the reports should be sent ex- 
actly n times. Although the ‘isolated network’ of the Fig. 1 could accommodate the 
communication for reporting, communication cost to the Web switch in this number 
of updates may be high. We extend the algorithm for less costly communication. 

Algorithm 1 ensures all the servers keep the numbers of executing requests equal 
among servers. At each server, while some of requests finish the execution, new re- 
quests are arriving by Round-Robin of Algorithm 1 . Thus the reporting is only neces- 
sary when the number of executing requests decreases. The Web switch distributes 
requests in Round-Robin or sends more requests to the server when the report comes. 



Load-balancing Algorithm 2. Lower-Communication-Cost Model 

For every request packet arriving at the Web switch; 

1. The Web switch merely distributes the income re- 
quests to servers in traditional 'Round-Robin' . Equal 
numbers of requests are executing in the servers. 

2. When the number of executing requests decreases the 
server reports that n requests are more needed. 

3. The Web switch subtracts n from the load value of 
the reported server in the Load Table . 

4 . IF the load values are not all equal the Web switch 
finds the lowest value and sends one request to the 
server, 

ELSE the Web switch sends the request in 'Round-Robin' 
order . 

5. Whenever the Web switch sends one request, it adds 
one to the load value of the target server in the Load 
Table. Continue at Line 2. 

In Algorithm 1, each server receives one more request instantly after the server re- 
ported that it has requests one less than other servers. Flowever the Web switch is dis- 
tributing requests in Round-Robin otherwise, the number of executing requests soon 
recovers after one request has finished. Recall that the execution lengths of requests 
are not same each other. Receiving the equal number of requests does not result in the 
equal number of executing requests. 

Let <5 be the period of Round-Robin. Assume one request has finished execution 
at a server and the server should receive one request within d> / 2 by Round-Robin. 
Let the server do not report, if no more requests finish within d? / 2 . The server 
counts the number of executing requests at every d> / 2 . Thus the server reports after 
one or more reports finished within the first half of 4> , and two or more requests fin- 
ished within the second half of <1) . If the server counts the number of executing re- 




226 



M.H. Ok and M.-s. Park 



quests at every <1> , the line 2 of Algorithm 2 reduces the communication cost to 
H m when mean m execution finishes are reported in each packet. 



4, Related Works and Comparison 



Many experiments and simulation results have demonstrated that the Weighted 
Round-Robin (WRR) comprises simplicity with efficacy at best[2]. Most recent work 
exploited load-update mechanism is Dahlin’s algorithm[7]. WRR uses periodic load- 
update. Once Web switch realized each server’s load, it sends requests to a less 
loaded server with higher rate and sends requests to a more loaded server with lower 
rate until they reach equal loads before next load-update. Dahlin’s algorithm also uses 
periodic load-update. The web switch realizes the differences in loads between serv- 
ers by load-update. It sends requests to servers with least loads. After all other serv- 
ers’ loads are equalized to the most loaded server, the Web switch distributes requests 
in Round-Robin manner before next load-update. Now we compare the proposed al- 
gorithms to these two algorithms with respect to load-update mechanisms. 

For any algorithm, higher rate of load-update achieves more balanced load distri- 
bution between servers. We define reporting cost, R , as follows; 

R : the number of packets received by the Web switch for a given period 

Thus reporting cost of periodic update, Rp-n-p, where n is the number of 
servers and reporting are p times in the period. While exactly n report packets 
should be used at every reporting time in periodic update mechanism, with the same 
reporting cost of Update-on-Finish (in Algorithm 2), Ru - Rp , the server use the re- 
port packet only when the number of executing requests decreases. Whereas each 
server uses exactly p packets during a given period in periodic update mechanism, 
the Update-on-Finish mechanism (in Algorithm 2) allow more reporting for the serv- 
ers that finish requests more frequently, and less reporting for the servers that finish 
requests less frequently with n ■ p packets. Therefore Update-on-Finish mechanism 
uses the report packets more efficiently with the same communication cost. In the 
next section we compare the proposed algorithm with the two algorithms. 



5, Performance Evaluation 



A simulator of Web server cluster is implemented. A Web switch and 5 Web servers 
constitutes the cluster. 4000 requests are processed for simulation of 10 seconds. The 
execution lengths of the requests range from 5 to 50 and each server processes 0.2 (in 
length) of a request per millisecond. The generation of execution lengths follows 
Pareto distribution. Pareto distribution have been found to correspond to some real 
world workloads such as a Web request’s execution length[2]. We model the execu- 
tion lengths as being generated independently and identically distributed from a dis- 
tribution that follows a power law, but has an upper bound. It is characterized by three 
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parameters: a , the exponent of the power law; k , the smallest possible observation, 
and p , the largest possible observation. The probability mass funetion of this Pareto 
distribution is defined as: 



m = - 



ak‘^ 



\-{klpY 



-a-\ 



, k <x< p . 



( 2 ) 



In most eases where estimates of a were made, a tends to be elose to 1, whieh 
represents very high variability in service requirements. It is known that Poisson dis- 
tribution is far from realistic for request arrival through Internet. Request arrival proc- 
ess follows uniform distribution. 

Weighted Round-Robin, Dahlin’s, and Load-Balancing Algorithm 2 were simu- 
lated with the same reporting costs. The reporting costs are averaged from one hun- 
dred generations of 4000 request packets. Algorithm 2 was simulated first with one 
hundred generations to reckon up the number of report packets sent from the servers. 
Then we found equivalent update periods to the average numbers of report packets as 
corresponded in Table 1. 



Table 1. Corresponding update periods with equal reporting costs 



Number of report packets 


1199 


1540 


2086 


Update periods 


42 msec 


32.5 msec 


24 msec 



WRR shows 3.82 as mean execution latency of 5 servers with 1199 report packets 
in Fig. 3. Dahlin’s shows 2.91 as mean execution latency of 5 servers with 1199 re- 
port packets in Fig. 4. Algorithm 2, the proposed load-balancing, shows 2.86 as mean 
execution latency of 5 servers with 1199 report packets in Fig. 5. Algorithm 2 has the 
lowest standard deviations(Sum of standard deviations - Dahlin’s: 6.30893; Algo- 
rithm 2: 5.69997) among the three algorithms. WRR and Dahlin’s performed using 
periodic update with the period of 42 milliseconds(equivalent to 1199 report packets). 
With the equal reporting cost, Algoritm 2 outperformed and the gap is ever increasing 
as more report packets are used. Figure 6 illustrates the effect of the update period. 





Fig. 3. Execution Latencies of Requests in 
WWR 



Fig. 4. Execution Latencies of Requests in 
Dahlin’s 
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Time (Milliseconds) 



Update Period 



Fig. 5. Execution Latencies of Requests in Fig. 6. Mean Execution Latency of 5 Servers, 
Algorithm 2 which the Web Clients Experience 



The values of the figure are averaged from one hundred simulations. The mean 
execution latency of 5 servers, system execution latency in the figure, reduces slightly 
more than Dahlin’s algorithm as the update period decreases. The Web switch sends 
reciprocally proportional numbers of packets according to each server’s load for a pe- 
riod to balance loads of the servers until next load-update in WRR. Dahlin’s balances 
the server loads as soon as possible with current load infonnation, and then the next 
packets distributed by the Web switch are sent in Round Robin until next load-update. 
Since Dahlin’s acquires load-balancing much earlier than WRR, the gap between the 
two algorithms is large. Algorithm 2 balances the loads whenever a report packet ar- 
rives, keeping request packets sent for load-balancing in the Load Table. Each server 
sends the request packet when it needs more request packets for a period whereas 
Dahlin’s use report packets at mandatory update time. Thus actions for load-balancing 
happens more times than Dahlin’s and this difference results in the gap between 
Dahlin’s and Algorithm 2. 



6, Conclusion 

The requests for dynamic Web pages are difficult to estimate the loads as executing 
scripts have variant execution lengths, and it becomes much variant if the Web page 
gets input parameters for the executing scripts. With only the information such as IP 
address and port number for Layer-4 Web switch the proposed algorithm balances the 
loads by sending packets as needed. Report packets contains the number of packets 
finished their executions. Thus the load-update is naturally non-periodic. The pro- 
posed algorithm showed 98.32 percent, 97.57 percent, and 96.15 percent of Dahlin’s 
algorithm in system execution latency with reporting costs equivalent to the update 
periods of 42, 32.5 and 24 milliseconds, respectively. Moreover the proposed algo- 
rithm considers the fairness for Web clients hence the Web clients would experience 
higher quality of service. Another advantage of the algorithm is its simplicity, since 
simpler distribution algorithm leads to higher throughput of the Web switch. 
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Although not resented in this paper the non-periodic load-update occurs asynchro- 
nously among servers. This reduces the communication workload for the Web switch 
than that of periodic update since all the report packets are not concentrated at any in- 
stant. Thus the new load-update mechanism would support higher scalability. 
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Abstract. As enterprises world- wide racing to embrace real-time man- 
agement to improve productivities, customer services, and flexibility, 
many resources have been invested in enterprise systems (ESs).All mod- 
ern ESs adopt an n-tier client-server architecture that includes several 
application servers to hold users and applications. As in any other multi- 
server environments, the load distributions, and user distributions in 
particular, becomes a critical issue in tuning system performances. 

Although n-tier architecture may involve web servers, no literatures in 
Distributed Web Server Architectures have considered the effects of dis- 
tributing users instead of individual requests to servers. The algorithm 
proposed in this paper return specific suggestions, including explicit user 
distributions, the number of servers needed, the similarity of user re- 
quests in each server. The paper also discusses how to apply the knowl- 
edge of past patterns to allocate new users, who have no request patterns, 
in a hybrid dispatching program. 

1 Introduction 

All modern ESs share a common IT foundation, namely, the n-tier client-server 
architecture. The architecture has a database server in the storage layer, multiple 
application servers in the service layer, several web servers in the interface layer 
and browsers or other access devices in the presentation layer. In the architecture, 
programs, applications, or transactions are held and executed in the application 
servers. When users logging on a system, he/she either selects an application 
server or is being assigned to one by the system. 

It is a vital issue to keep response time under control for most system ad- 
ministrators. When the increases of memories and CPUs reach the hardware 
limitation, adding more application servers to a system is a reasonable alter- 
native. When an ES has multiple application servers, distributing users with 

* This study is supported by the MOE Program for Promoting Academic Excellence 
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similar applications to the same application servers plays an important role in 
tuning system performance^, as pointed out by documents of a major ES system 
[ 1 , 10 ], 

Commercial products, such as SAP R/3, equipped with a simple dispatch- 
ing algorithm considers only user numbers and server response time [1]. The 
task of grouping users is left to system administrators [1, 10]. In addition to the 
rough guideline of grouping financial users into one server and logistic users into 
another, system administrators need specific suggestions, such as explicit user 
distributions, the number of servers needed, and the similarity of user requests 
in each server. To address the needs, the paper shows a set of algorithms to col- 
lect transaction patterns, establish pattern prediction rules, associate patterns 
with users, group users into clusters with patterns, select clusters to form dis- 
tributions, and dispatch users with hybrid methodology that can dispatch users 
with or without patterns. Patterns can be collected from system logs or traces 
because the transactions run by each user in daily operations are specified and 
set in implementation phases and are seldomly changed after system going live. 

2 Finding Users’ Regnlar Transactions 

To record system and user statuses, most enterprise systems include various 
tracing mechanisms. Among the various recordable data are user sessions and 
applications executed in sessions. For the purpose of the paper, these data are 
transformed into user profiles. A user profile is a set of {( user-id, transaction- 
set )}, where user-id is the account name of a user and transaction-set is the set 
of transactions accessed by the user in a session. 

To compute or estimate regular transactions for each user, three steps are 
employed. The first one computes large itemsets with any existing set oriented 
pattern discovering algorithm, such as [3, 11]. In the second algorithm, each large 
1-itemset is examined against each user to form users’ regular transactions. For 
new users who do not have accumulated enough entries to computer personal 
regular transactions, the paper propose to predicate their regular transactions 
with the association rules computed with known Aprori algorithms. Assume the 
regular transactions is shown in Table 1. 

New users do not have any records in the user profiles and do not have associ- 
ated regular transactions. However, dispatching programs still need to dispatch 
them in run-time. Therefore, help for dispatching programs to guess the patterns 
of new users are in order. 

If each new user provides one of the transactions she/he wishes to access 
after logging on, the dispatching program can check if the transaction has high 
association with any large itemsets. If so, the union of the large itemsets denote 
the user’s Predicted Regular Transaction set is the third step. 



® An application in an ES corresponds to an atomic and unbreakable transaction. In 
this paper, transactions and applications are used interchangeably. 
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Table 1. Regular Transactions 



User-Id 


Regular Transactions 


1 


{A, B, E, F, H} 


2 


{A, B, E, F, H} 


3 


{A, B, E, F, H} 


4 


{I, J, K} 


5 


{B, I, J, K} 


6 


{B, I, J, K} 


7 


{P, Q, R} 


8 


{P, Q, R} 


9 


{P, Q, R} 


10 


0 



Definition 1 The Associated Regular Transactions of a transaction, t, under a 
set of large itemsets, P, a user profile, U , is 



AT{t) = Upgp A t^pGPu{p\t) > confidence threshold, 



where CPu{p\t) 



{s\s^U,p^s. transaction set} 
{s\s^U.,t^s. transaction set} 



3 The Definitions of Similarity Measure, Clusters, and 
Distributions 

Load balancing programs utilizes the benefit of multiple servers at the cost of 
wasting memories in keeping duplicated programs and data. In sophisticated ap- 
plication servers with hundreds of people on-line, the memory needed for trans- 
actions are considerable [2]. Therefore, users with similar regular transactions 
should be grouped into one cluster, which are then assigned to a server. This 
section defines the measure of similarity and proves related properties. Formal 
definitions of clusters and distributions are also included. 

Definition 2 A Cluster is a set of users that share similar regular transactions. 

The similarity of users in a cluster is measured by AR, Application Reusabil- 
ity. The AR of a transaction in a cluster is defined as the percentage of users 
in the cluster evoking the transaction, and the AR of a cluster is defined as the 
average AR of transactions in the cluster. 

Definition 3 

— The Regular transactions, T, of a user, u are defined as T(u) 

— AR of a transaction, t, in a cluster, c, is defined as 

|{m \ u € c and t € T(u)}| 



R{t, c) 
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— The AR of a cluster, c, is defined as the average AR of regular transactions 
in the cluster. 



R{c) 



(u) and 

\{t\u G c and c G T(m)}| 



Theorem 1 Conditional Anti-Monotonicity of AR R(c) decreases with new 
user added to c, if the number of transactions accessed by the new user is fewer 
than or equal to the average number of transactions accessed by original users. 

Proof 

ommited to save space. 

Therefore, -R(c) has the property of Conditional Anti- Monotonicity, which al- 
lows POCA to prune hapless user groups. 

Definition 4 

— A cluster whose AR exceeds a given ARThreshold is called qualified cluster. 

— A set of clusters is comprehensive under a user profile, U, if the union of 
the clusters includes all and only all users with regular transactions. 

— A set of clusters is disjoined if the intersection of any two clusters in the set 
is empty. 

— A set of qualified clusters is a distribution under a user profile, U, if they 
are comprehensive under U and disjoined. 

The clusters of {1, 2, 3}, {4, 5, 6}, and {7, 8, 9} have ARs of 100%, 11/12, and 
100%, respectively, under the running example. If the ARthreshold is set at 55% 
then all three are qualified clusters. The three clusters are both comprehensive 
and disjoined in the example, and therefore form a valid distribution. 

4 Clustering and Distributing by POCA 

POCA returns distributions that satisfy administrator constraints and has the 
fewest number of clusters, and the rules associating single transactions to pre- 
dicted regular transactions. The constraints include an AR threshold, min-sup- 
port, rule confidence threshold. The recommendations guarantee that when all 
frequent users logging on the system and accessing all regular transactions, each 
server still has an AR above the given AR Threshold. Information included in 
the recommendations are number of servers, user distribution and associate rules 
of predicting patterns from transactions. 

POCA relies >u, which is a chain, to hold Conditional Anti-Monotonicity 
and to form each user combination at most once. 

Definition 5 Let S be the set of users in a user profile, U. The order >u is 
defined on S such that for any U\,U2 G S, u\ >u U2 if 
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- T{ui) > T{u 2 ), or 

— T{u\) = T{u 2 ) and user-id of ui < user-id of U 2 - 

POCA includes two major steps in computing the recommendations - com- 
puting the set of qualified clusters and selecting clusters to form server distribu- 
tion. The main steps are listed as following: 

Initialization: for each user with regular transactions, turns the user into a 
single-user cluster. These clusters form C\, the 1-user cluster set. 

Composing Q+i from Ci: Conditional join Ci with C\ to form Ci+\. A cluster 
Ci from Ci is added by one user in c\ from C\ if two criteria are met. The 
first one states that the use from c\ has lower rank in >;/ than any user in c,. 
The second criterion asserts that the new cluster has an AR value exceeding 
the given threshold. 

Repeating Last Step Until No New Clusters are Cenerated: If Q+i is empty 
then POCA has found all qualified clusters in Ci, . . . , and Q; Otherwise, 
POCA has to repeat the last step. 

Selecting Clusters to Form Distributions: Finding the fewest number of qualified 
clusters to form distributions. The algorithm includes a loop to check if i 
clusters can form a distribution where 1 < f < Ci . The loop is aborted when 
distributions are found. 

In the running example, the first cluster set is formed by turning each user 
into a cluster. If setting the AR threshold at 55%, C 2 , the set of 2-user clusters, 
is equal to the join of C\ and Ci. C 2 = {{1, 2}, {1, 3}, {2, 3}, {5, 6}, {5, 4}, {6, 
4} {7, 8}, {7, 9}, {8, 9} }. C3 is the conditional join of C 2 and Ci. C3 = {{1, 2, 
3}, {4, 5, 6}, {7, 8, 9}}. C 4 , is the conditional join of C3 and Ci. C4 = {{1, 2, 
3, 7}, {1, 2, 3, 8}, {1, 2, 3, 9}}. C^ is empty since potential 5-user clusters have 
ARs lower than 55%. 

Now that all cluster sets are ready, it is time to select clusters to form user dis- 
tributions. The selection continues by examining 1-cluster distribution, 2-cluster 
distribution, etc. In the running example, 1-cluster and 2-cluster distributions 
are empty since there are no 9-user and 5-user clusters. On the other hand, 
3-cluster distributions have various alternatives and are all returned to system 
administrators. 3-cluster distributions and corresponding ARs are listed Table 2. 
POCA just picks one to examine the comprehensiveness of clusters. The algo- 
rithm returns all the distributions that satisfy the requirements and let system 
administrators to decide which distribution he/she prefers. 



Table 2. 3-Cluster Distributions and ARs 



User Distributions 


ARs 


{1, 2, 3}, {4, 5, 6}, {7, 8, 9} 
{1, 2, 3, 7}, {4, 5, 6}, {8, 9} 
{1, 2, 3, 8}, {4, 5, 6}, {7, 9} 
{1, 2, 3, 9}, {4, 5, 6}, {7, 8} 


100%, 11/12, 100% 
18/32, 11/12, 100% 
18/32, 11/12, 100% 
18/32, 11/12, 100% 
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5 An AR Based Hybrid Dispatching Approach 

Each ES typically has a dispatching program listening to networks and accepts 
user requests. The program resides an application server, intercepts user re- 
quests, and direct them to application servers. 

The distributions suggested by POCA bases on frequent patterns in user 
profiles. For new and infrequent users, POCA does not suggest their distributions 
directly but returns association rules, PR (Prediction Rules), in the output to 
help dispatching program make the decision. To apply the rules, a new user only 
needs to provide a transaction he/she plan to evoke after logging on the ES. With 
the association rules, a dispatching program can distribute a user according to 
its associated predicted regulation transactions. If the first transaction does not 
lead to any predicted regular transactions, then the single transaction works as 
the basis for dispatching. 

An AR Based Hybrid dispatching algorithm distributes users while keeping 
the AR of each server as high as possible. In the dispatching procedure, users 
are distributed to a server by one of the three alternatives: 

— If a regular user logging on, then send the user to the recommended server 
and return to listening mode. 

— If an infrequent user logging on with a transaction, then find the predicted 
regular transactions implied by the transaction. If no entry matched then 
the single transaction is treated as the predicted regular transaction. 

— Compute the potential new AR in each server with the addition of the user 
with predicted regular transactions. Assign the user to the server with the 
highest AR, and update the AR in the corresponding server. 

The distribution in the running example has ARs of 100%, 11/12, and 100% 
in the three servers. If a new user with user-id 10 wishes to log on the system 
and submits an A as the first transaction then the user has a assumed predicted 
regular transaction of ABE. The ARs after adding ABE to the three servers 
would be 18/20, 14/24, 12/24. Therefore, the new user is distributed to the first 
server, and the distribution becomes {1, 2, 3, 10}, {4, 5, 6}, {7, 8, 9}. 



6 Related Work 

With the Internet rush, many researches have been devoted to distributing user 
requests with Distributed Web Server Architecture to improve the performance 
of web servers. Depending on the locations where request distributions happen, 
these researches are classified into client-based, DNS (Domain Name Server)- 
based, dispatcher-based, and server-based by [5, 4, 14, 15]. Since current Http 
protocol is stateless, each request is routed independently to a web server [4]. 
All of the above researches assume that requests can be independently route to 
different servers, where as in the application servers of ESs, requests from the 
same users have to be routed to the same server. 




236 



P.-Y. Hsu and P.-H. Ting 



Clustering literatures are classified into partitioning clusterings and hierar- 
chical clusterings [12, 6, 9]. If k clusters are needed, partitioning Clustering 
choose k centroids initially and gradually tune the constituents of each clusters 
or centroids with some criteria function until a locally optimized characteristic 
is met. Hierarchical clusterings can be further divided into agglomerative and 
divisive clusterings. As the name suggested, agglomerative clusterings gradually 
merge smaller clusters into larger clusters until k clusters are found. Divisive 
clustering, on the other hand, splits larger clusters into smaller clusters until k 
clusters are found. POCA is more close to agglomerative although it does not 
have predefined cluster numbers. 

Most clustering algorithms employ Euclidean distances to compute similar- 
ity. The shorter the distances the similar the data points in the clusters are. 
However, Euclidean distances are not ideal for clustering categorical data. For 
example, to cluster transaction sets with Euclidean distances, each set has to be 
translated into a sparse binary vector. Many set oriented algorithms use Jaccard 
coefficient [12] and ROCK [7]. However, Jaccard coefficient and ROCK along 
cannot describe the number of elements in each cluster, which are important to 
calculate the buffer efficiency. 

7 Conclusion 

Managers in enterprises often add users to ESs as they extend E-business prac- 
tices to various parts of corporate operations. With the addition of each user, new 
pressures on performances are brought upon the systems. Yet, system response 
time is one of the most important factors in measuring user satisfactions. 

Since ESs tend to consume many memories, application servers can easily 
run up all memories set by hardware constraints. When this happens, next step 
commonly adopted in boosting performance is adding application servers to ESs. 
With multiple application servers in the scene, distributing users with similar 
application requirements to the same application servers increases buffer uti- 
lization and increase the lead time to next hardware upgrades. POCA provide 
suggestions of such distributions with the fewest number of clusters. Along with 
suggestions are Application Reusability in each server for the reference of system 
administrators . 

Several issues require further studies, such as modeling user profiles with 
sequences, dynamically updating user patterns, incorporating CPU and systems 
loads into dispatching and distribution algorithms, and improving the efficiency 
of POCA. 
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Abstract. Due to the inherent bandwidth burstness, variable-bit-rate (VBR) en- 
coded videos are very difficult to be effectively transmitted over clustered video 
servers for achieving high server bandwidth utilization and efficiency while guar- 
anteeing QoS. Previous bandwidth smoothing schemes tend to cause long initial 
delay, data loss and playback jitters. In this paper, we propose a content-based 
bandwidth smoothing scheme, called CBBS, which first splits video objects into 
lots of small segments based on the complexity of picture content in different vi- 
sual scenes so that one segment is exactly including one visual scene, and then, 
for each segment, a constant bit rate is allocated to transfer it. Performance eval- 
uation based on real-life MPEG-IV traces shows that CBBS scheme can signif- 
icantly improve the server bandwidth utilization and efficiency, while the initial 
delay and client buffer occupancy are also significantly reduced. 



1 Introduction 

Due to high scalability and low cost, clustered video servers [6] become inevitable to 
provide large capacity to serve thousands of concurrent clients. It is comprised of two 
parts: one RTSP server node and several RTP server nodes. The RTSP server is respon- 
sible for exchanging control messages with clients, while RTP servers are responsible 
for transferring video data to clients. Video objects are often divided into lots of fixed- 
length segments that uniformly distributed on RTP server nodes. 

Usually, in order to guarantee QoS, for each stream, a bandwidth of b equaling to 
the peak bit rate of requested video object must be reserved in the corresponding RTP 
server nodes. Nevertheless, for VBR encoded video objects, the peak bit rate is far larger 
than the mean bit rate [8]. It indicates that the most reserved bandwidth is not used for 
most of the time. It tends to cause low server bandwidth utilization and better solutions 
are necessary. 

To improve the network bandwidth utilization, previous works have proposed lots 
of schemes. Among them, a constant rate transmission and transporting (CRTT) scheme 
[3] employs constant bit rate (CBR) transmission of VBR video objects. It works by 
calculating the minimum bandwidth to prevent the client buffer underflow. Further, con- 
sidering both the bandwidth and the client buffer size, it determines the amount of data 

* This paper is supported by National 863 Hi-Tech R&D Project under grant 
NO.2002AA1Z2102. 



H. Jin et al. (Eds.): NPC 2004, LNCS 3222, pp. 238-243, 2004. 
© IFIP International Federation for Information Processing 2004 




CBBS: A Content-Based Bandwidth Smoothing Scheme for Clustered Video Servers 



239 



to be transmitted to the client in advance. Other schemes, such as the e-PCRTT [2], 
MCBA [1], DBA [8] and MVBA [4] [5], first prefetch video data until half of client 
buffer be filled, and then dynamically select the transmission bit rate in the ’’river” con- 
structed between the maximum transmission rate that guarantees no buffer overflow and 
the transmission rate that guarantees no buffer underflow. Since the transmission band- 
width is dynamically changed several times, the peak bit rate is smoothed somewhat. 
However, for the popular client buffer configurations, it tends to result in long initial 
delay or large initial bandwidth requirement that prefetch video data until half of client 
buffer be filled. 

In this paper, we focus on the pre-recorded VBR video objects and propose a 
content-based bandwidth smoothing scheme, called CBBS, which can significantly im- 
prove the server bandwidth utilization and efficiency while guaranteeing QoS. The fol- 
lowing sections are organized as follows. In section 2, we describe the content-based 
bandwidth smoothing scheme. Section 3 estimates the performance of CBBS via real- 
life MPEG-IV traces. Finally, section 4 ends with conclusions and future works. 



2 Content-Based Bandwidth Smoothing Scheme 

The bit rate variety of VBR encoded videos is resulted from two issues. One is the 
picture content variety of different visual scenes, called inter-scene variable-bit-rate. It 
results in the size fluctuation of different /-frames. The other is the frame size variety in 
the visual scene, called intra-scene variable-bit-rate. Usually, in the same visual scene 
with same picture content, the size of /-frames are larger than that of P-frames and 
P-frames have smallest frame size. Tab. 1 shows the quantitative analysis for different 



Table 1. The quantitative analysis for different kinds of variable bit rate. 



Movie Length Std. dev. of Std. dev. of 

(frames) inter-scene VBR(Kb/s) intra-scene VBR(Kb/s) 



Silence of the Lambs 


89998 


510.001 


177.665 


Mr. Bean 


89998 


364.201 


243.740 


Star Wars IV 


89998 


192.513 


133.620 


Jurassic Park 


89998 


490.650 


241.544 


Aladdin 


89998 


361.410 


207.373 


Robin Hood 


89998 


390.707 


290.764 


Sports-Soccer 


89998 


421.061 


230.432 


Sports- Formular- 1 


30334 


373.620 


223.432 


News-ARD News 


22498 


375.941 


309.414 


News-ARD Talk 


89998 


339.640 


235.097 



kinds of variable bit rate based on real-life MPEG-IV video traces'. Since /-frames are 
encoded by the basic content of visual scenes. In Table 1, we assume that one visual 

' The traces can be obtained from the web site: http://www-tkn.ee.tu-berlin.derfitzek/ 
TRACE/pics. 
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scene is comprised of one group of pictures (GOP) and use the bit rate variety among 
/-frames to represent the variable bit rate caused by the inter-scene. As one can see, the 
bit rate variety caused by the inter-scene is far larger than that caused by the intra-scene. 

Based on the above analysis, allocating a constant bit rate (CBR) for transferring 
each scene would not result in large client buffer occupancy and long initial delay since 
the variable quantity of the intra-scene VBR is relative small. Thus, we can use a split- 
ting scheme to divide video objects according to the picture content of visual scenes 
and allocate constant bandwidth for transferring each segment. After bandwidth allo- 
cation, the maximum client buffer requirement to guarantee no buffer overflowing, and 
the segment information, such as the start playback time, the allocated bandwidth, and 
the IP address of the storage RTP node, are available and can be maintained on the 
RTSP server node. Whenever the RTSP server admits a request, for each segment of 
requested video object, it just needs to query the RTP node whether they have enough 
bandwidth to transfer in the corresponding time interval. If so, it notifies RTP server 
nodes to reserve the corresponding constant bandwidth. Since the reserved bandwidth 
is the allocated constant bandwidth not the peak bit rate, the server bandwidth utilization 
is improved significantly. 

Let d and p{p > 0) be the initial delay and the threshold for splitting video ob- 
jects, respectively. Sequence {xi, X2, • ■ • , Xn} represents the frame size sequence of 
the requested video object, where \i represents the size of the i-th frame. We define 
S to be the size of first /-frame in the segment starting from time point ts and ending 
at time point te, and define bmin and Bmax to be the minimum bandwidth at which 
clustered video servers may transmit and the maximum buffer occupancy at client side 
over a given interval [ts, te], respectively, where the client buffer must be guaranteed no 
underflowing and transmission is started from initial buffer level Q. 

The video splitting procedure starts a new segment if and only if the processing 
frame is an /-frame, and the size of this /-frame Xk satisfies the following equation. 

IXfc - < p X S' (1) 

For the first segment, the allocated bandwidth is set to be the mean bit rate of the 
hrst segment, i.e. 




In order to guarantee no buffer underflow during the first segment being in playback, 
at the time point of each frame for the first segment, the amount of data sent must larger 
than or equal to the amount of data consumed. Thus, we obtain 

d = ma,x{ — ^ 7 -^-^— /g} ^ G {Ij 2, . . . , fg} (3) 

h 

For other segments, we use following equation to calculate the allocated bandwidth. 

^ X)j=l Xj — iJ2k=l Xk + Q) ^ r+ I 1 + I o ^ ^ rA\ 

brrnn=max{ ^ — — 7 } t & {ts + l,ts + 2, . . . ,te\ ( 4 ) 

t tg 
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Once the bandwidth bmin has been allocated, the maximum client buffer occupancy 
during the processing segment being played back and the initial buffer level Q for trans- 
ferring next segment can be derived as follows. 

t 

Bmax — TTl(lx\t X bmin Q ^ ^ Xfc} ^ C {fg “f 1, fg -f 2, . . . , fg} (5) 

k—ts 

te 

Q — ^min ^ (^e ^s) ^ ^ Xk (6) 

k—ts 



PROCEDURE FOR VIDEO SPLITTING AND BANDWIDTH ALLOCATION 

INPUT: Video frame sequence X^, X^}, where N is the number of frames included in inputting 

video object. 

OUTPUT: Initial delay, maximum client buffer occupancy and video segments with allocated bandwidth. 

1. S=Xj, Q=0, ts=l, te=l, lj=0, d=0, b^=0, B=0, //Xj is the flrstZ-frame in frame sequence. 

2. FOR(k=l;k<=N;k+-\^[ 

3. IF (5==2f^){ // processing the first segment. 

4. IF i((Xk is not an I-frame)ll(X^ is an/-frame)&&(IX^-5l<p5)){ 

5. h, + =X^-, 

1 

6. ELSE { 

7. bj=b/k,te=k, 

8. calculate using equation (3). 

9. ts=-d; 

9. calculate Q and using equation (6) and (5), respectively. 

10. IF B=B^"2 

1 1 . output the first segment stating from ts and ending at te, with allocated bandwidth b^, and 
initial delay d, 

12. ts=k, S=X\ 

1 

1 

13. ELSE { //processing other segments. 

14. is not anZ-framejl^X^ is an/-frame)&&(IX^-SI<p5)){ 

15. calculate b , using equation (4); 

1 

16. ELSE { 

17. te=k, 

18. calculate Q and using equation (6) and (5), respectively; 

19. IF (B<B^^J ^=-S,„3 

20. output a video segment stating from ts and ending at te, with allocated bandwidth b . 

21. fi=L 5=X; 

) 



22. output the maximum client buffer occupancy 5. 



Fig.l. Pseudo-code of video splitting and bandwidth allocation algorithm. 



The algorithm of video splitting scheme and bandwidth allocation scheme is pre- 
sented formally in Fig.l. Notations used in this hgure have been dehned above. 
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(a) (b) (c) 

Fig. 2. Comparison of (a) the average initial delay, (b) the average buffer occupancy, and 
(c) the maximum concurrent streams among e-PCRTT,CRTT,and CBBS with splitting 
threshold p = 0.4. 

3 Performance Evaluation 

For the illustrative purpose, we evaluate the performance of the CBBS scheme by the 
experiment and compared it with that of the e-PCRTT and CRTT scheme. In the ex- 
periment, the clustered video servers are the prototype of Turbogrid streaming servers 
^ with one RTSP server node and 8 RTP sever nodes. Each node uses 1.4GHz CPU 
and lOOMb/s NIC. Real-life MPEG-IV traces with different contents including Movies, 
Sports, News, Talk show, several Episodes, and Cartoon are splintered into lots of small 
segments based on the proposed video splitting scheme with threshold p = 40%. The 
length of each video trace is 89,998 frames. All video traces are played back at a frame 
rate of F = 25 frames/s. 

There are three kinds of popular clients used in our experiment-the multimedia PDA 
with buffer capacity 8 Mbytes, the set-top box with buffer capacity 32 Mbytes and the 
PC with buffer capacity 64 Mbytes. Client requests are generated using the Poisson 
arrival process with an interval time 1/A. The arrival rate A is varied from 200 to 1200 
per hour. Once generated, client selects a video object and sends the request to clustered 
video servers. If the request is admitted, the client simply playbacks the received stream 
until the transmission is completed. 

Fig. 2 plots the performance comparison among e-PCRTT, CRTT and CBBS 
schemes with splitting threshold p = 0.4, where the inner figure of part (a) is the mag- 
nification of the initial delay for CBBS with p = 0.4. From this figure, we can easily 
find that CBBS scheme significantly outperforms other two schemes. For example, the 
average initial delay and the average maximum buffer requirement of CBBS scheme is 
less than 1 second and 2MB, respectively, whereas those of the e-PCRTT scheme and 
the CRTT scheme needs approximately 120 seconds, SMB and 160 seconds, 16MB, 
respectively. For the bandwidth utilization which can be indicated via the maximum 
concurrent streams supporting by the clustered video servers, CBBS scheme can sup- 
port 1044 concurrent streams, while e-PCRTT and CRTT schemes can just support 
approximately 600 concurrent streams. 

^ Turbogrid streaming servers are developed by Cluster and Grid Computing Lab of Huazhong 
University of Science and Technology. 
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4 Conclusions and Future Works 

In this paper, we propose a content-based bandwidth smoothing scheme, called CBBS, 
which can signihcantly improve server bandwidth utilization and efficiency while guar- 
anteeing QoS. Unlike previous schemes, CBBS scheme first splits video objects into 
small segments based on the complexity of picture content in different scenes. Then, 
it uses constant bit rate to transfer each segment so that the intra-scene variable bit 
rate can be effectively smoothed. When admitting a request, CBBS scheme accurately 
judges whether the remaining bandwidth of RTF nodes is enough to transmit the stored 
segments in the corresponding time interval. It significantly reduces the effect of the 
inter-scene variable bit rate on the server bandwidth utilization. 

On going researches include: 

1 . Evaluating the effect of splitting threshold p on the performance of clustered video 
servers and deriving the optimal p based on statistically analysis of large amount of 
real-life video traces; 

2. Developing optimal disk retrieving models and strategies to work with the scene- 
based video striping scheme; 

3. Designing a time-scaled resource reserving protocol to reduce the impacts of traffic 
burstness and improve network utilization over the Internet. 
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Abstract. Networked storage has become an increasingly common and 
essential component for cluster. In this environment, the network storage 
software through which the client nodes can directly access remote network 
attached storage is an important and critical requisite. There are many 
implementations exist with this function, such as iSCSI. However, they are not 
tailored for the large-scale cluster environment and cannot well satisfy its high 
efficiency and scalability requirements. In this paper, we present a more 
efficient technology for network storage in cluster and also give detailed 
evaluation for it through our implementation - SuperNBD.The results indicate 
that SuperNBD is more efficient, more scalable, and better fit for cluster 
environment. 



1. Introduction 

With the steadily inereasing of data capaeity produeed by scientifie applieations and 
high I/O rate requirement, the networked storage has become a common but essential 
component of high performance cluster environment. It permits hosts to easily access 
remote data through network. 

Most storage area networks (SAN) used to adopt Fiber Channel [1] as their private 
storage network, however, due to the expensive cost of the hardware, most medium 
and small-scale enterprises cannot afford it. Recently, with the advent of Gigabit (or 
even 10 Gigabit) Ethernet, the IP based storage networks (i.e. IP-SAN), which can 
leverage the existing LAN environment, are becoming more popular. 

The network storage software for hosts to transparently access the data block on 
storage devices is the critical layer for network storage. iSCSI is a kind of network 
storage protocol currently used for IP-SAN. It encapsulates and transports the SCSI 
command and data across network [3], dividing the large buffer come from buffer 
cache of OS kernel and packetting it before sending them to the server. So every 
buffer transmission involves several individual messages of SCSI command and 
sub-buffers of data block [3]. In addition, in order to cope with the security problems 
when transmits across wide area network, it introduces some mechanisms which are 



'This work was supported by the National High-Technology Research and Development 
Program of China (863 Program) under the grant No.2002AAlZ2102 and the grant 
NO.2002AA104410 
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unnecessary in a close and reliable network environment such as cluster, these 
mechanisms will induce more overhead and lead to decrease of performance. 

We present a new compact and efficient network storage implementation technol- 
ogy for the cluster environment named SuperNBD. The practical experiment shows 
that it is well efficient and scalable. 

The rest of this paper is organized as follows. In the next section we will discuss in 
detail about design issues and implementation of SuperNBD. In section 3 we analyze 
its performance and scalability, and compare these metrics with other implementa- 
tions. Section 4 concludes our paper. 



2. Design Issues and Implementation 



SuperNBD is composed of two different parts according to their functionality: Su- 
perNBD client and SuperNBD server; both reside at the buffer cache layer of kernel, 
as shown by figure 1 . The SuperNBD client receives the data operation requests is- 
sued by higher layer of kernel, such as VFS and directly forwarded it to SuperNBD 
server. We have introduced some specialized mechanisms to improve its efficiency 
and scalability as follows. 



C 



I i e n t 




Fig. 1. The overview architecture of SuperNBD 

In order to increase the total data throughput of storage system, several service 
threads are introduced on both sides specializing in data processing, so that the data of 
a file distributed across multi-devices can be read and written concurrently. 

Since all the blocks within a single request are corresponding to the same device 
in sequence, only one simple message with the information about first block’s se- 
quence number and the total block quantity need to be sent to server before data 
transmission, less then those of iSCSI. 

During the whole process along I/O path, all data blocks requested are transferred 
directly from client data block cache to that of server and vice versa, eliminates any 
necessary of memory copy within SuperNBD. 

In order to increase the write bandwidth for highly data-intensive applications we 
adopt a kind of asynchronous write mechanism. As one characteristic of block I/O 
storage, each write operation will completely overwrite the content, so it is unneces- 
sary to read corresponding data block from disk beforehand, only allocate a block 
cache from main memory and directly store the data into it, thus, as to the client is- 
sues this request, the whole operation is completed promptly. The kernel will flush the 
entire dirty data blocks to disk when the free memory reaches a certain low water- 
mark, so that the process of client side writing and server side real disk writing can be 
parallelized and greatly improve the total data writing throughput of storage I/O path. 
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Things are different for data reading. Most often, when elients issue read request, 
the file data is just on the physieal storage deviee. The large eaehe that greatly eon- 
tributes to high write bandwidth has little advantage here, and the read ean only be 
handled synehronously In order to inerease the total read performanee of SuperNBD, 
we present a kind of adaptive data bloek prefetehing meehanism aeeording to the 
loeality feature of data reading on the server end. For example, assume A and B are 
two requests reaeh server side in order, but might issued by different elients, both 
relate with the same deviee. Request A read bloeks with sequenee number (bloek 
number) from aO to an, and B from bO to bm, if the value of \b0 - an \ are within a 
eertain reasonable interval, it ean be eonsidered as sequential operation, and when 
server finishes the request B, it still keep reading next several bloeks, so that the next 
read operation ean mostly be hit on eaehe. 

A pipeline meehanism has been introdueed, whieh ean parallelize the deviee 
reading with data transmission, two major time-eonsuming operations during reading 
proeess. In SuperNBD, when some of bloeks are ready (hit in buffer eaehe or just be 
read from deviee), and others are during handling by loeal kernel, the ready data is 
firstly transferred to the elients. After finishing sending the previously ready data, 
most of the other bloeks are ready in memory now. In this way, it ean greatly reduee 
the response time for read request and eompletely exert the potential performanee of 
hardware resouree. 

In addition, the feature of SuperNBD server’s eombining with the loeal buffer 
eaehe ean also benefit for shared operations, i.e. multi-elients simultaneously aeeess 
the same data bloeks. In this situation, only one physieal deviee operation needed for 
eaeh bloek, so that most requests ean be servieed direetly from the server buffer eaehe. 
This enable SuperNBD to seale well to large number of elients. 



3. Performance Comparison and Analysis 

In this seetion, we evaluate the effieieney and sealability of SuperNBD and present 
performanee eomparison with unh-isesi[3]. There are two sets of environments for our 
experiments: 

( 1 ) Setl. Consists of 33 nodes, eaeh has four AMD opteron(tm) proeessor (2.2GFIZ), 
8GB memory, SuSE Linux 8.0 with kernel 2.4. 19SMR All nodes eonneeted by 
Gigabit Ethernet. This environment is used to test effieieney and sealability of 
SuperNBD. 

(2) Set2. Consists of 9 nodes, each with dual 2.4GHz Intel Xeon Proeessors , 
1GB memory. Red Flat 7.2 with kernel 2.4.18-3smp. Both eonneeted by lOObit 
Ethernet. This environment is used for performanee eomparison between Su- 
perNBD and unh-ISCSI, for unh-isesi eannot support x86_64 arehiteeture. 

3.1. Efficiency and Scalability Evalnation 

In this test, we use one SuperNBD server and eontinually inerements the number of 
elients, eaeh read or write 1GB data with 1MB reeord size. Figure 2 shows Su- 
perNBD has well effieieney and sealability and ean always keep peek performanee 
along with seale inereasing. The asynehronous writing and adaptive read-ahead 
meehanism greatly eontribute to this aehievement. 
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Fig. 2. Efficiency and Scalability of SuperNBD 

3.2. Performance Comparison with Other Implementations 
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Fig. 3. Performance comparison between SuperNBD and unh-iscsi 



Figure 3 shows the performanee comparison between SuperNBD and unh-iscsi. 
SuperNBD outperforms unh-iscsi on both writing and reading, mainly due to its 
compact protocol and optimized implementation mentioned in section 2. 



4, Conclusion and Future Work 

In this paper, we present a compact but efficient technique to construct network stor- 
age for cluster and also evaluate the performance of our implementation named Su- 
perNBD. As the result shows, SuperNBD is efficient and scalable, and better fit for 
the cluster environment. 
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Abstract. A fault tolerant parallel virtual file system is designed and imple- 
mented to provide high I/O performance and high reliability. A queuing model 
is used to analyze in detail the average response time when multiple clients ac- 
cess the system. The results show that I/O response time is with a function of 
several operational parameters. It decreases with the increase in I/O buffer hit 
rate for read requests, write buffer size for write requests and number of server 
nodes in the parallel file system, while higher I/O requests arrival rate increases 
I/O response time. 



1 Introduction 

Parallel Virtual File System (PVFS) is a parallel file system for Linux clusters, 
which stripes the data among the cluster nodes and accesses these nodes in parallel to 
achieve high I/O throughputs. A Cost-Effective Fault-Tolerant Parallel Virtual File 
System (CEFT-PVFS) has been designed and implemented to meet the critical 
demands on reliability while still being able to deliver a considerably high throughput. 

When multiple clients submit data-intensive jobs at the same time, the response 
time experienced by the user is an indicator of the power of the cluster. In this paper, 
a queuing model is used to analyze in detail the average response time when multiple 
clients access the fault tolerant parallel virtual file system. 



2 Architecture of the Fault Toleraut Parallel Virtual File Systeui 

The diagram of the fault tolerant parallel virtual file system is shown in Fig.l. All I/O 
server nodes are divided into two groups, the primary one and the mirroring one. File 
data is striped across the primary group and duplicate on the mirroring group. When 
Writing, data are stored in the primary group in RAID 0 style and backed up in the 
mirroring group simultaneously. Data is retrieved from the nodes that have less work- 
load between the mirrored pair to optimize the read performance. 
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Fig.l. Diagram of the fault tolerant parallel virtual file system 



3 Definitions and Notations of Disk I/O Parameters 

I/O response time in the fault tolerant parallel virtual file system depends primarily on 
the network bandwidth, I/O buffer size and poliey, and disk I/O serviee time. The 
disk I/O may beeome the bottleneek and it is determined by three main parameters, 
namely, seek time, rotational lateney and transfer rate. 

Definitions and notations for several relevant disk I/O parameters are given below: 
C: Number of disk eylinder; 

S: disk seek time, with its maximum being denoted by ; 

R: disk rotational lateney, with the full rotation time being denoted by ^max ; 

D\ disk seek distanee, whieh is a random variable in the range of [0,C- 1] ; 

Ps : the probability that the seek distanee is 0, = P{D = 0} ; 

L: data striping size; 

Tj : data transfer time, Pj = L! r^, where is disk transfer rate; 

Yy,Y^„Y '■ disk read, write and whole serviee time, respeetively. 

The relationship between disk seek time and the seek distanee i is given as: 

S^a + b4i i>0 

where S is the seek time, a is the arm aeeeleration time, b is the faetor of seeking 
traek. a is the seek time between two neighboring eylinders. 

The mean seek time and the seeond moment ean be aeeurately approximated 

i?(^) = (l-ft)[a + ^hV^] 

E{S^) = +}-h\C-\) + ^ah^C-\^ 

When requests to a disk are independent of one another, the rotation time is as- 
sumed to be uniformly distributed in , with probability density funetion: 

The mean rotation time and the seeond moment are: 

^W=i^.ax 

The disk drive serviee time is y = S + R + Tj- 
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M.Y.Kim studied traces of the real disk service times and found it to be generally 
distributed Therefore, we adopt the M/G/1 model to analyze the disk response 
time in the cluster. 



4 I/O Response Time Analysis 




Fig-2. The Queuing Model of I/O Service 



The queuing model for the system service under data-intensive load is shown in 
Fig. 2. Part of the main memory space in a server node is used for I/O cache buffer to 
hide the disk I/O latency and to take advantage of data reference locality. Assume the 
number of I/O server nodes to be N in each group. I/O requests follow a Poisson 
process; with a mean arrival rate of k . The arrival rate to the server node i is , 

where p. is the probability that the request is directed to node i. When the I/O request 

is a small read or write, where the data size is equal to or less than the size of a striped 
block, and the workload on a server node among a group is balanced, p^ is equal to 

1/N. When the I/O request is a large read or write, where data is striped on all of the 
nodes in a group, p. is equal to 1. So the typical range of p. is [1/N, 1]. Let and 
p^ = 1 - denote the read and write probability of a request, respectively. Assume 
the I/O cache buffer hit rate for read requests to be and the probability of write 
buffer being full to be . Thus, the effective arrival rate to each disk is: 



Assume that the network service time and the I/O buffer service time are exponen- 
tially distributed with the average times and , respectively. Therefore, request 



residence time in the network and in the I/O buffer can be modeled using 

the M/M/1 queuing model The average residence time in a disk drive can be 



calculated according to the M/G/1 model. 

The average I/O response time can be expressed as: 
7 = W +W + P -W 

^ mt ^ cache ^ ^ disk disk 



l-PP-P., l-PP-P 



+[pa-K)+pjj^{- 






+ E(Y)] 
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Where t =l/r = r / r and L is the size of data to be accessed on each I/O 

net net c memory 

server node. Even though the data is striped in fixed blocks to server nodes in a 
RAIDO style, the blocks can be incorporated into a large block with length equal to L. 
R , and is the available network bandwidth and the available memory ac- 

net memory J 

cess rate respectively. 

The average I/O response time can be obtained from above formula for Z under 
different workload and application environments with different parameters. 



5 Conclusion 

The I/O response time in the fault tolerant parallel virtual file system is discussed in 
the paper. The analytical results show the different level of sensitivity of the average 
I/O response time to various system and operational parameters such as I/O buffer 
size, data locality, read/write probability, request data size and server group size. 
These results provide useful insight into the complicated relationships among differ- 
ent system and operational parameters, thus allowing potential optimization for sys- 
tem configuration and I/O performance. 
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Abstract. In this paper, we describe the design and implementation of 
SARCNFS File System for network-attached clustered storage system. 
SARCNFS stripes the data and metadata among multiple NAS nodes, and pro- 
vides file redundancy scheme and synchronization mechanism for distributed 
RAID5. SARCNFS uses a self-adaptive redundancy scheme for file data ac- 
cesses that uses RAID5-level for large writes, and RAID 1 -level for small write 
so as to dynamically provide flexible switch between RAIDl and RAID5 to 
provide the best performance. In addition, SARCNFS proposed a simple dis- 
tributed locking mechanism that uses RAID5-level for full stripe writes, and 
RAIDl -level for temporary storing data from partial stripe updates. As a result, 
low response latency, high performance and strong reliability are achieved. 



1 Introduction 

The distributed RAID concept was proposed by Stonebraker and Schloss [1]. Exam- 
ples of early-distributed RAID systems include Swift/RAID [2], Petal [3] and Tertiary 
Disk [4]. In distributed RAIDS implementation, there are two significant issues: 

1 . Small writes accesses latency is high because of the extra reads. 

2. Synchronization problem is presented in cluster environment for simultaneous 
writes. 

In this paper, we design and implement a file system for clustered NAS storage en- 
vironment called SARCNFS that distributes user data and metadata among multiple 
NAS nodes, similar to xFS [5], [6], Zebra [7], Swarm [8], Frangipani [9], and 
PVFS[10]; meanwhile, we employ three redundancy schemes in SARCNFS file sys- 
tem: the first scheme is a striped, file-mirroring scheme like RAIDl. In this scheme, 
the user data are stored twice by file system; the second scheme is a RAID5-like 
scheme, which uses parity-based partial redundancy; finally, we adopt a self-adaptive 
scheme that uses RAID5-level for large writes, and RAID 1-level for small write so as 
to dynamically provide flexible switch between RAIDl and RAIDS to provide the 
minimal performance degradation. In addition, we proposed a simple locking mecha- 
nism that uses RAIDS-level for full stripe writes, and RAIDl-level for temporary 
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storing data from partial stripe updates. Since partial stripe writes use the RAIDl 
scheme, we avoid the synchronization necessary in the RAIDS scheme for this access 
pattern to addresses the synchronization problem in distributed RAIDS environment. 



2 ARCNFS Clustered NAS File System 

2.1 System Architecture Overview 

SARCNFS is designed as a server-less system [S], [6], in which multiple NAS nodes 
deal with storage of file data and manage metadata. Each SARCNFS file is striped 
across a set of nodes in order to facilitate parallel access. This set of modes is selected 
in a random way, and the data are distributed with a round-robin police using the set 
of nodes. The layout of SARCNFS is shown in Fig 1 . 




Fig. 1. Physical layout of SARCNFS in Clustered NAS Environment 

The specifics of a given file distribution are described with three metadata parame- 
ters: base node number, number of nodes, and stripe size. These parameters, together 
with ordering of the modes for the file system, allow the file distribution to be com- 
pletely located. To access SARCNFS file data, the client first obtains the metadata of 
the SARCNFS file on the NAS nodes, and then, the client sends requests directly to 
the NAS nodes storing the relevant portions of the file. 



2.2 Self-Adaptive Redundancy Scheme 

In RAIDl scheme implementation in SARCNFS, each NAS node stores two files per 
client file. One file is the data file used to store the data, just like the case in PVFS. 
The other file is the redundancy data file used to store redundancy. The contents of a 
redundancy block are identical to the contents of the corresponding data block. As a 
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result, the RAIDl scheme in SARCNFS has ability to utilize all the available band- 
width on a read operation to provide parallel read performance. On a write, all the 
nodes must be written twice. 

The distributed RAIDS scheme in our SARCNFS also has redundancy file on each 
node in addition to the data file, like the RAIDl scheme. In the distributed RAIDS 
scheme, however, these fdes contain the parity data for specific portions of the data 
files. On a write operation, the client checks the offset and size to judge if any stripes 
are about to be updated partially. There can be at most two partially updated stripes in 
a given write operation. The client reads the data in the partial stripes and also the cor- 
responding parity region, and then, it computes the parity for the partial and full 
stripes, and writes out the new data and new parity. 

In the Adaptive Redundancy scheme, the level of redundancy is selected depended 
on the following rule: every client write access is broken down into three portions: (1) 
a partial stripe write at the start (2) a portion that updates an integral number of full 
stripes (3) a trailing partial write. Depending on data alignment and size, portions can 
be empty. For the portions of the write that updates full stripes, we compute and write 
the parity, just like in the RAIDS case. For the portions involving partial stripe writes, 
we write the data and redundancy like in the RAIDl case, except that the updated 
blocks are written to an overflow region on the nodes. The blocks cannot be updated 
in place because the old blocks are needed to reconstruct the data in the stripe in the 
event of a crash. When a file is read, the nodes return the latest copy of the data, 
which could be in the overflow region. 



2.3 Distributed Locking Mechanism 

We implemented a simple distributed locking mechanism to ensure that two clients 
writing concurrently to disjoint portions of the same stripe do not leave the parity for 
the stripe in an inconsistent state in distributed RAIDS scheme. When a node receives 
a read request for a parity block, it knows that a partial stripe update is taking place. If 
there are no outstanding writes for the stripe, the node sets a lock to indicate that a 
partial stripe update in progress for that stripe. It then returns the data requested by the 
read. Subsequent read requests for the same parity block are put on a queue associated 
with the lock. When the node receives a write request for a parity block, it writes the 
data to the parity fde, and then checks if there are any blocked read requests waiting 
on the block. If there are no blocked requests, it deletes the lock; otherwise it wakes 
up the first blocked request on the queue. The client checks the offset and size of a 
write to determine the number of partial stripe writes to be performed. If there are two 
partial stripes involved, the client serializes the reads for the parity blocks, waiting for 
the read for the first stripe to complete before issuing the read for the last stripe. This 
ordering of reads avoids deadlocks in the locking protocol. 

The RAIDl scheme does not require any additional synchronization. The same is 
true of the Adaptive Redundancy scheme since it uses mirroring for partial stripe 
writes. 
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3 Conclusion 

SARCNFS uses a self-adaptive redundaney seheme for file data aeeess that uses a 
eombination of RAIDS and RAIDl writes to store data. Full stripe writes use the 
RAIDS seheme. RAIDl is used to temporarily store data from partial stripe updates. 
As part of the SARCNFS implementation, we have proposed a simple loeking meeha- 
nism that addresses the eonsisteney problem of distributed implementations of RAIDS 
redundaney. Sinee partial stripe writes use the RAIDl seheme, we avoid the synehro- 
nization neeessary in the RAIDS seheme for this aeeess pattern. In addition, unlike 
many other elustered file system, our SARCNFS file system is not dependent on 
elient modifieations sinee it is full eompatible with standard distributed file systems 
sueh as NFS and CIFS. As a result, SARCNFS dynamieally provide the high 
performanee, low lateney and the strong reliability. 
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Abstract. Human error and incorrect software (a.k.a. soft-failure) are key 
impediments to dependability of Internet services. To address the challenge, 
storage providers need to provide rapid recovery techniques to retrieve data 
from a time-based recovery point. Motivated by it, a snapshot facility at the 
block level called SnapChain is introduced. Compared with former 
implementations, when managing different versions of snapshots, SnapChain 
minimizes disk space requirement and write penalty of master volume. In this 
paper, the metadata and the algorithms used in SnapChain will be explained. 



1 Introduction 

The Berkeley/Stanford ROC (Recovery-Oriented Computing) Project claims that 
rather than device failure (a.k.a. hard- failure), human error and incorrect software 
(a.k.a. soft-failure) are the largest causes of failures in Internet services [1]. To protect 
against soft-failure, creating point-in-time copy of data periodicity and maintaining 
different versions of point-in-time copies are necessary. Amazon.com, for example, is 
reported to create point-in-time copy of data as frequently as three times per hour [1]. 

Differ from the other two classes of point-in-time copies, split mirror and 
concurrent, snapshot requires much less storage and needs no advanced setting up 
prior to executing a point-in-time copy. Therefore, it is more suitable for soft-failure 
recovery. For snapshot facility, the expectation for continuous operation is 
commonplace. Flowever, the ability to create and maintain different versions of point- 
in-time copies of the data efficiently, with minimal interruption and minimal 
overhead, is critical. This paper will only focus on the block level snapshot facilities. 

Linux LVM and EVMS [2] are volume managers and support snapshot. In them, 
the snapshot of master volume, namely shadow volume, is achieved through a pseudo 
volume that contains pointers to two separate physical regions. One region is simply 
the unchanged blocks in the master volume. The other region, named private region, 
collects the original state of the master volume blocks just before they are updated, as 
well as any new changes made by the point-in-time consumer to the snapshot. Sun 
StorEdge [3] treats master volume and its related shadow volume as a volume pair. A 
bitmap is used to keep track of the differences of a volume pair that occur after the 
established point-in- time. In above implementations of snapshot, the shadow volumes 
of the same master volume have no relationship with each other. If they can share the 
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old data just like master volume and shadow volume share unmodified eontents, the 
mount of data, whieh is eopied from master volume and kept in the private region of 
shadow volume, ean be redueed. Based on sueh optimizations, SnapChain, the 
snapshot faeility presented in this paper, ean minimize the effort, time and 
ineremental eapaeity, whieh are neeessary to obtain point-in-time views. Espeeially in 
the environment supporting soft-failure reeovery, where a master volume has a 
number of shadow volumes, the effeets of optimizations are remarkable. 

The remainder of the paper is organized as follows: The next seetion deseribes the 
metadata and algorithms used in SnapChain. After evaluating SnapChain in seetion 3, 
the eonelusion will be drawn in the fourth seetion. 



2 SnapChain 

SnapChain is implemented in Linux, and designed to support up to 255 shadow 
volumes for a master volume. It uses a pool-and-pointer design, where metadata 
keeps the information of loeation and states of data ehunk (a group of bloek) of 
shadow volumes. Shadow volumes of the same master volume share a large storage 
pool, named snapshot pool, to keep their private data. Every shadow volume owns a 
logie private region in snapshot pool. The snapshot pool is eomposed of one or 
multiple bloek deviees in liner mode, and ean be expanded through absorbing new 
deviees on demand. Registering a volume in SnapChain will make it into a master 
volume. After registering, the eontent of master volume is kept untouehed. Losing 
metadata or uninstalling SnapChain will not render data of master volume unusable. 



[ Free spaee manager ] *dIjTee spaee bitmapTlII^ 



Ma] 


0 manager 

F ^ 


[ ><C[^hunk search table^]l!2> 




Copy manager 


1 ^''Xl^^^^unk map table]]^ 

Kernel space 








User space 


1 Configure tools | ^ 


Snapshot pool 



Fig. 1. Architecture of SnapChain 



As Fig. 1 indicates, SnapChain has a mid-level bloek device driver and user space 
tools. The device driver can be inserted arbitrarily in a system’s layered block I/O 
hierarchy. It consists of three components: map manager, free space manager, and 
copy manager. Map manager uses CMT (Chunk map table) and CST (chunk search 
table) to translate the virtual address of shadow volume into the corresponding 
physical address of master volume or snapshot pool. Free space manager uses FSB 
(free space bitmap) to allocate and reclaim disk space from snapshot pool for shadow 
volumes to store their private data. To maintain consistency of shadow volumes, copy 
manager controls copying data from master volume to snapshot pool. The user space 
tools used to configure device driver follow a set of operations. 
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2.1 Metadata 



CMT is used to mark the eurrent states of ehunks in shadow volume and master 
volume. Eaeh volume owns a CMT, and every ehunk has a flag in CMT. For a ehunk 
of master volume, the flag has two states: 0 means the ehunk is not updated sinee the 
newest shadow volume is ereated; 1 means the ehunk is updated. For a ehunk of 
shadow volume, the flag has three states: 0 means the ehunk is not stored in the 
private region of the shadow volume; 1 means the ehunk is stored in the private 
region of the shadow volume and is the original state of master volume; 2 means the 
ehunk is stored in the private region of the shadow volume and is updated by the 
point-in-time eonsumer. CST is used to map the virtual ehunk of a shadow volume 
into the physieal ehunk in snapshot pool. Eaeh shadow volume has a CST. In order to 
seareh rapidly, CST is organized into a hash table. The metadata of shadow volumes 
of the same master volume are ehained by a bi-direetion list, namely shadow volume 
list, in time order. The master volume keeps the head of the list. Extent (a group of 
ehunks) is the unit during alloeating disk spaee. In FSB, every extent of snapshot pool 
has one bit to reeord its alloeation state. 
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Fig. 2. Data layout in snapshot pool 



It is not neeessary to explieitly store CMT, CST and FSB on disk. Instead, enough 
information on snapshot pool ean be maintained to allow them to be reeonstrueted. 
Fig. 2 shows the data layout in a snapshot pool. The first 16KB is SPDA (Snapshot 
Pool Deseriptor Area), ineluding snapshot pool info, master volume info, and shadow 
volume info. On eaeh deviee of snapshot pool, SPDA is alloeated for baekup reasons. 
The remainder of snapshot pool is eontinuous extents. The extents are divided into 
several elasses: free extent elass and extent elasses of eaeh shadow volume. The PL 
method [4] is used to manage the extents. The first 4 bytes of eaeh extent are used as a 
pointer to indieate the next extent of the same elass. Thus the extents of the same elass 
beeome a list. Shadow volume info keeps the head of the shadow volume extent list. 
In an extent, the first ehunk, exeept the first 4 bytes, is used to reeord the serial 
numbers of the following ehunks. The nth is the serial number of the virtual ehunk, 
whieh the (n-i-l)th ehunk in the extent eorresponds to. The size of ehunk (Csize) is 
ehosen between 1KB and 8KB, and the size of extent is (Csize^/4) MB. Large ehunk 
size ean reduee the footprint of the metadata and inerease the addressing range of 
CST. However, it may result in signifieant performanee overhead. 
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2.2 Commands and Algorithms 

SnapChain information can be created, displayed, and manipulated by the user space 
tools. The command “spcreate test mv /dev/sdal /dev/sda2” registers the master 
volume /dev/sdal and creates its snapshot pool on /dev/sda2. The command spremove 
removes from the system all knowledge of the specified snapshot pool and releases 
the master volume from device driver. The similar commands svcreate and svdelete 
can be used to create and delete shadow volumes. When svcreate is executing, 
SnapChain stalls all incoming I/O requests for the time required to flush all 
outstanding writes to the master volume. When everything is synchronized on stable 
storage, SnapChain appends new metadata of shadow volume to shadow volumes list, 
and creates a new virtual block device. The command svrestore restores master 
volume from the specified shadow volume. SnapChain also provides to user programs 
an ioctl command interface and a /proc interface for device information and statistics. 

The read request to master volume is directly made to master volume. When 
writing master volume, COFW (copy on first write) [3] is executed. Unlike former 
implementations, only the original date of the chunk with the CMT flag of 0 is copied 
to the private region of the newest shadow volume. To find the physical chunk that 
holds the request block of specified shadow volume, SnapChain first references 
CMTs of shadow volume list to locate the volume whose private region the specified 
chunk is stored in. It will take the following steps: 

1 . If the CMT chunk flag of the target shadow volume does not equal to 0, the targe 
shadow volume is what we want to find. 

2. Otherwise, check the newer shadow volume. If the CMT chunk flag is 1, it is the 
volume that we want to find. If not, repeat until the volume is found. If no such 
shadow volume exists, the master volume is the volume that we want to find. 

Then SnapChain uses CST of the found volume to get the physical address. When 

writing shadow volume, if the CMT flag of the target chunk is 1, the data of target 
chunk should be moved to elder shadow volume first. Deleting a shadow volume 
involves the similarly procedure. The command svrestore uses the algorithm of 
reading shadow volume to get the contents of the specified shadow volume, and uses 
the algorithm of writing master volume to update the master volume. 



3 Evaluation 

The goal of SnapChain is to minimize the write penalty and the capacity necessary to 
obtain point-in-time views. The response time and consumed capacity of SnapChain 
will be compared to Linux LVM. As expected, SnapChain achieves better 
performance and less consumed capacity. 



Table 1. Consumed capacity and response time when writing to master volume. 



Snapshot facility 


SnapChain 


Linux LVM 


Shadow volume number 


4 


8 


4 


8 


Consumed Capacity (MB) 


12 


12 


48 


96 


Response Time (ms) 


2335 


2338 


5481 


9767 
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Table 2. Response time when reading from shadow volume. 



Snapshot facility 


SnapChain 


Linux LVM 


Shadow volume number 


4 


8 


4 


8 


Response Time (ms) 


180 


179 


178 


180 



The tests are run on a dual Pentium 700 MHz eomputer with 256 MB of RAM. 
NEC S2100 disk array (RAID 5) is eonneeted direetly to Emulex Lightplus 750 HBA 
eard in the PC. The versions of Linux kernel and LVM are 2.4.20 and 2.00.07 
respeetively. The ehunk size used by LVM and SnapChain is set to 4KB. To avoid the 
influenee of Linux buffer eaehe, “raw” deviees assoeiated with the bloek deviees are 
used. In the first test, the master volume has 4 shadow volumes. In the seeond test, it 
has 8 shadow volumes. Table 1 shows eonsumed eapaeity and response time, when 
eopying 10MB data to the area of master volume, whieh has not been updated sinee 
the first snapshot is ereated. The memory eonsumed and response time of the 
SnapChain is mueh less than those of LVM, for only 10MB data is eopied to snapshot 
pool in SnapChain rather than n*10MB data in LVM, where n is the number of 
shadow volumes. As the number of shadow volumes inereased, the degree of 
superiority of SnapChain over LVM is inereased. Table 2 shows response time, when 
reading 10MB data from a shadow volume. The response times of SnapChain are 
similar with those of LVM. Thus the optimizations do no harm to the performanee of 
aeeessing shadow volume. Sinee restoring master volume from a speeified shadow 
volume involves reading from shadow volume and writing to master volume, the 
reeovery time of the SnapChain is better than that of LVM. 



4 Conclusion 

In this paper, we propose SnapChain - a bloek level snapshot faeility. SnapChain 
enables master volume and its snapshots to share as many eontents as possible. 
Consequently it ean support more snapshots, and aehieve better performanee of 
aeeessing master volume, than its peers do. SnapChain supports writing request to 
snapshot. Besides, it ean ereate snapshot transparent to applieation without disturbing 
eurrent data aeeessing, and reeovery a master volume from a speeified eopy-in-time in 
a short time. Therefore SnapChain is suitable for soft-failure reeovery. 
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Abstract. This paper discusses the application of object-oriented and 
generic programming techniques in high performance parallel computing, 
then presents a new message-passing interface based on object-oriented 
and generic programming techniques — GOOMPI, describes its design 
and implementation issues, shows its values in designing and implement- 
ing parallel algorithms or applications based on the message-passing 
model through typical examples. This paper also analyzes the perfor- 
mance of our GOOMPI implementation. 



1 Introduction 

One of the most important distinction between parallel computing and nor- 
mal sequential computing is the complexity and diversity of parallel computing 
models. From the view of programming, there are three main parallel computing 
models — the data-parallel model, the shared-variable model and the message- 
passing model [1]. 

The data-parallel and shared- variable models are concise and intuitive in pro- 
gramming, however, they often require tight-coupled parallel computers based 
on shared memory, e.g. parallel vector machines or SMPs, while it is difficult to 
implement them directly and efficiently on more popular architectures based on 
distributed memory, such as MPPs, COWs and SMP clusters. 

The message-passing model is more intuitive and easy to implement on par- 
allel computing environments based on distributed memory. 

Typical message-passing libraries include the Parallel Virtual Machine (PVM) 
[2] and the Message Passing Interface (MPI) [3] . Both of them can be easily im- 
plemented in homogeneous or heterogeneous distributed environments, and so 
became prevailing. 

Parallel programs using message-passing libraries are often able to gain a 
fairly more excellent performance than that using other models through elabo- 
rate handcrafted optimizations on messages sending and receiving between par- 
allel processes. 
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However, coding a parallel program based on message-passing is more diffi- 
cult than that based on data-parallel or shared-variable models. It often takes 
people much time to deal with the part of a program concerning with messages. 
When handling only messages of primitive data types or their arrays, the effort 
is still acceptable. However, the effort becomes far more considerable and the 
complexity of the message-passing part of the program increases significantly 
when passing complex dynamic data structures such as binary trees, graphs, 
or large sparse matrices stored in orthogonal lists. It then becomes difficult to 
guarantee the correctness and efficiency of the program. 

Worst of all, the complexity of the message-passing part always obscures 
the structure of the algorithm and the program itself. These programs tend to 
fall into implementation details and lose their abstraction and genericity. It also 
causes great obstacle in reading, maintaining and extending parallel programs. 

In one word, it is difficult to map a parallel algorithm into a message-passing 
based parallel program rapidly and intuitively. It is the problem that the paper 
tries to solve. 

In the last decade, object-oriented (00) techniques gain great success in 
program and software system construction. As a complement paradigm to 00, 
generic programming techniques aided with inlining and template metaprogram- 
ming gain genericity and extensibility through compile time parametric polymor- 
phism. The OH — h Standard Template Library (STL) [5] is a milestone of generic 
programming techniques. 

Many works have been doing on applying 00 and generic programming 
techniques to HPC areas, such as POOMA [6], Janus [7] and HPC-I— I- [8], etc. 
The GOOMPI presented in this paper also takes advantage of those techniques, 
trying to provide programmers a unified generic message-passing interface, effec- 
tively simplifying the development process of parallel programs and dramatically 
improving their abstraction, genericity as well as flexibility. 

The rest sections of this paper are organized as follows: in Sect. 2, a full 
discussion on our GOOMPI is presented with some design policy and imple- 
mentation detail; then in Sect. 3 we outline a well known matrix multiplication 
example using the GOOMPI; also the performance evaluation of GOOMPI com- 
pared to normal MPI is given in Sect. 4; in the last two sections, we show some 
issues about the related work and provide our conclusions on what we have 
experienced. 



2 The Design and Implementation of GOOMPI 

2.1 The Layered Structure of GOOMPI 

As an attempt of applying 00 and generic programming techniques to high 
performance parallel computing, we developed a generic object-oriented message- 
passing library — GOOMPI. It adopts MPI, a currently widely used library in 
high performance parallel computing areas, as its underlying implementation 
basis. By using 00 and generic programming techniques, it constructs a generic 
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high performance message-passing framework to effectively support the transfer 
of user-defined dynamic data structures of arbitrary complexity. 

GOOMPI provides a complete framework for message-passing. It includes a 
set of well-designed interfaces and class hierarchies, and consists of two layers — 
the serialization layer and the message-passing layer. 

The layered communication architecture of GOOMPI is depicted in Fig. 1. 
Further explanation is presented in the following sections. 




Fig. 1. Layered Gommunication Architecture of GOOMPI 



2.2 Message-Passing Layer 

The first layer of GOOMPI, the message-passing layer, uses the lostream Li- 
brary to abstract the underlying MPI communication functions and isolate its 
implementation details. 

The iostream library is an important part of the G-l— I- standard library. Its 
architecture is efficient and excellent for extension. 

The iostream library also separates I/O operations into two functional 
layers — the formatting layer and the transferring layer. In order to effec- 
tively perform message-passing using MPI under the architecture of iostream, 
we write a new transferring layer by deriving a basic_mpi_buf class from 
std: : basic_streambuf . It supports messages of both the XDR format [10] (for 
heterogenous environments) and native format (for homogenous environments). 
It follows the buffered iostream manner, and also provides optional ability to do 
communication without extra copies to or from buffers to make efficient large 
bulk of consecutive data transfer possible. Besides improved performance, it also 
results in more scalability. Unlike some implementations of MPI, our library sup- 
ports transferring messages of arbitrary size due to the extensible architecture 
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of the iostream library, and this requirement is common in scalable high perfor- 
mance scientific computing. 

Besides standard-mode send and receive, MPI provides many other commu- 
nication modes such as buffered mode, synchronous mode and ready mode, as 
well as nonblocking communications and a variety of collective communications 
(broadcast, scatter, gather, reduce, all-to-all, etc.). We choose to adopt policy- 
based design strategy [11] to support these variations without code duplication or 
loosing efficiency. Actual communication operations are encapsulated uniformly 
in operation policies as template parameter of basic_mpi_buf . 

GOOMPI provides several pre-defined communication operation policies: 
send_recv, isendjrecv, beast, scatter, gather, reduce, all_to_all, etc. Each 
policy has two member functions: 

void sendCconst void* addresses!], std::size_t sizes [] , int nplayers = 2); 
void recv( void* addresses!], std::size_t sizes!], int nplayers = 2); 

For example, policy send_recv implements these two member functions using 
MPI_Send and MPI_Recv, while beast using MPI_Bcast correspondingly. 

Two classes mpi_ostreain and mpi_istreajn derived from std: : basic, 
ostream and std : : basic jistream respectively wrap up the basic jnpi.buf class 
to provide a convenient stream interface. 

2.3 Serialization Layer 

In order to support message-passing based on arbitrary data types, MPI does 
provide a mechanism to facilitate user-defined types by using data structures 
like MPIJDatatype and functions such as MPIJiddress, MPI_Type_struct and 
MPI_Type_commit. However, it is tedious and cumbersome to use. More unfortu- 
nately, it still limited to manipulate simple Plain Old Data (POD) types such 
as C-style structs. 

We introduce necessary serialization and deserialization of any objects. That 
is, when sending object of any type, the message content includes the object 
memory layout which is converted into a specific stream with some format; when 
receiving, the information extracted from the message forms a stream of that 
format, then a copy of the original object can be easily reconstructed. 

We choose the Boost Serialization Library (BSL) [9] as the implementation 
basis of GOOMPI’s serialization part. BSL supports noninvasive serialization. 
It can easily serialize primitive types as well as STL containers. BSL exploits 
a layered design approach, taking a stream as a parameter of its archive class 
to specify the actual storage and transmission of serialized objects. The archive 
itself only concerns with the serialization-related issues while cares nothing about 
how to store or transfer the serialized objects. This design makes it convenient 
for us to combine BSL with MPI’s message-passing functionality through our 
MPI streams. 

GOOMPI customized two high performance archive classes, mpi.oarchive 
and mpi_iarchive to serialize and deserialize objects of POD types as well as 




GOOMPI: A Generic Object Oriented Message Passing Interface 265 



non-POD types (for example, which have nontrivial constructor / destructor) 
and transfer them through MPI streams. However, under homogenous environ- 
ments, extra optimizations are provided for commonly used POD types such as 
C style structs, arrays of POD types, and even std: : vectors with POD type 
elements. Especially, for large arrays and std: : vectors of POD types, unneces- 
sary copy operations to or from buffers are avoided. These optimizations result 
in high performance and little abstraction overhead of GOOMPI programs. 

2.4 Stream Style User Interface 

Finally, GOOMPI provides a port class as a facade [12] incorporating all the 
classes above, and exposes a concise iostream style interface for message-passing. 

We borrow from OOMPI the notion of port. A port represents a communi- 
cation channel between parallel processes. Any G-l — h objects with appropriate 
serialization support or data of primitive types can be transferred as messages 
through the channel. 

Users of GOOMPI need not to know the internals of a port. However, one 
may want to customize some of the behaviors of a port before using it. 

There are three aspects of port’s communication behavior: 

1. The communication operation of the port can be blocking or nonblocking, 
point-to-point or collective. It can be implemented by choosing appropriate 
communication policy of basic_mpi_buf . 

2. The port can be an input-only, output-only as well as supports both input 
and output. 

3. The port can transfer any type of data or only messages of specific data 
types, or in specific sequence. 

Accurately, the last two aspects are restrictions on the behavior of a port. 
Restrictions do not always mean limitations. In fact, it is helpful for detecting 
logical errors of a program at compile time or runtime by imposing restrictions 
on the port class. In specific cases, useless functions can be removed from a port 
by the programmer from the beginning to eliminate the possibility of making 
mistakes. For example, performing an input operation on a output-only port will 
cause compile time error immediately. 

Programmers can also choose to apply no restriction on their port classes for 
more functionality or just for convenience. In the latter case, they also lose the 
opportunity for the compiler to help detecting program errors earlier at compile 
time. 

We use policy-based method again to design the port class. There are three 
policies correspondingly: 

Operation Policy is the same as the operation policy of MPI streams and 
basic_mpi_buf . It specifies blocking or nonblocking, point-to-point or collec- 
tive communication operations. The default is standard mode point-to-point 
operation. 
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Direction Policy specifies communication direction of the port to be in, out or 
inout. The default is inout. Attempts to communicate in invalid direction 
will be detected at compile time. 

Message Type Policy specifies the allowed data types to be transferred by the 
port. The allowed types can be specified conveniently using a mechanism 
called Type Patterns which is developed by authors of this paper. (For ex- 
ample, pattern MyClass means the port can only transfer messages of a type 
named MyClass; pattern seq_type(A, B, C) guarantees the port can trans- 
fer messages of type A, B and C in sequence; while set_type (A, derived (B) ) 
means types A or all types derived from B. Leave it blank means any type is 
allowed. 

The formal definition of port is as follows: 



ConnnChamiel 

OperationPolicy 
P2PComm 
Collect iveComm 
DirectionPolicy 
MsgTypePol i cy 
TypePattern 



port < OperationPol i cy , DirectionPol i cy , 
MsgTypePol icy> 

P2PComm \ CollectiveComm 

send_recv | isend_i:ecv | ... 

beast I scatter j gather j all_to_all | ... 

inout I in | inout 

TypePattern 

TypeName \ any_type | set_type (TypeList ) 
seq_type (TypeList ) | derived ( TypeWauie ) j ... 



For example, programmers can define an output port which broadcasts mes- 
sages of type Matrix like this: 

goompi : :port<bcast , out, Matrix> p (...); 

At most time, a common port can be defined as follows, if one simply wants 
to use an ordinary point-to-point communication: 

goompi : :port<> p (...); 



Such a port uses all its default polices. 



2.5 Other Features of GOOMPI 

Besides communication, GOOMPI also provides other useful generic components 
to facilitate parallel programming: 

Virtual Topology of Parallel Tasks GOOMPI presents several components 
to support virtual topologies of parallel tasks. For example, class templates 
mesh_view, graph_view and tree_view represent Descartes space topologies of 
any finite dimensions, ordered tree and generic graph structures respectively. 
They all provide convenient functions to specify neighbors (such as the left, 
right, up and down neighbors, etc.) of each task or particular task groups (for 
instance a particular row, column or block of tasks). Users can also extend 
existing topologies or define new topologies when necessary. 
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Parallel I/O Real world parallel applications always have to deal with a 
large amount of data, for example, very large matrices or vectors. Storing and 
retrieving such data efficiently in parallel is a practical requirement. GOOMPI 
provides C++ iostream style components to support parallel I/O. Large objects 
can be serialized and deserialized using GOOMPI’s parallel I/O streams. 

3 Case Study: Implementing Cannon’s Algorithm for 
Matrix Multiplication with GOOMPI 

We choose to implement a classic parallel algorithm — the Cannon’s algorithm 
for Matrix Multiplication [13] to illustrate the usage of GOOMPI. To represent 
matrix data structures, we make use of the Matrix Template Library (MTL) 
[20], which is an excellent C++ template library that supports a variety of 
matrix representations as well as many linear algebra functionality. Owe to the 
extensibility of generic programming, it is convenient to integrate MTL with 
GOOMPI to effectively divide and transfer various matrices data types. 

The GOOMPI program exploits a master / worker paradigm. The master 
scatters matrices to all worker tasks (including itself), after SPMD style 
computation, the result is eventually gathered by the master. Suppose there 
are P * P parallel tasks, torus_view<2> is used for representing the 2D-torus 
topology of these tasks. Source code of the master is as follows: 

template <typename MatrixA, typename MatrixB, typename MatrixO 
void cannonCconst MatrixAfe A, const MatrixBfe B, MatrixC& C) { 

torus_view<2> self(P, P) ; // global 2D-torus view of P*P tasks 

port<scatter> q(self) ; 
port<gather> r(self); 

blocked_view<MatrixA> VA(A, P, P) ; // create blocked views of matrices 
blocked_view<MatrixB> VB(B, P, P) ; 
blocked_view<MatrixC> VC(C, P, P) ; 

q « VA << VB; // scatter VA and VB 

cannon_worker<MatrixA, MatrixB, MatrixOO; // act as a worker 
r » VC; // gather the result 

} 



The following is the source code of workers: 

template <typename MatrixA, typename MatrixB, typename MatrixO 
void cannon_worker O { 

torus_view<2> self(P, P) ; 
port<scatter> q(self) ; 
port<gather> r(self); 



// global 2D-torus view of P*P tasks 
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MatrixA a; // local submatrices of A and B 

MatrixB b; 

q » a » b; // receive submatrices from master 

MatrixC c(a.nrows(), a.ncolsO); // local result 

port<isend_recv> s; // nonblocking send to avoid deadlock 

if (self.i) { // initial alignment 

s(self . left (self . i) ) << a; 

s (self .right (self . i) ) >> a; 

} 

if (self.j) { 

s(self ,up(self . j)) << b; 

s(self ,down(self . j)) >> b; 

} 

for (int i = 0; i < P; 
mtl : :mult (a, b, c) ; 
s(self .leftO) « a 
s (self .right () ) » a 
s(self.upO) « b 
s(self .downO) » b 

} 

r « c; // send back result to master 

} 



4 Performance Comparison of MPI and GOOMPI 

We tested the performance of LAM MPI [15] and GOOMPI on a 16-node PC 
Cluster connected with Ethernet. The result is presented in Fig. 2. 

Fig. 2 (a),(b) and (c) show that on trasferring arrays or std: :vectors of 
primitive data types or POD structs using both point-to-point and collective 
communication operations (such as broadcast), there is almost no abstraction 
penalty. GOOMPI shows a notable performance on par with MPI. Fig. 2 (d) 
suggests that GOOMPI is much faster when transferring a doubly linked list. 

In fact, GOOMPI is especially good at supporting any irregular dynamic 
data structures. In many situations where complex data structures can not be 
well fit into primitive types or C arrays, GOOMPI allows more natural and intu- 
itive representations, while corresponding MPI program, is tedious and clumsy. 
For example, it is boring to reconstruct pointers in dynamic data structures 
explicitly. 



i++) { 

// c += a * b 
// cyclic left shift a 

// cyclic up shift b 
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KBytes 



KBytes 



(a)P2P: array/vector of int 



(b)P2P: array/vector of a struct 





KBytes 



(c)Broadcast: array/vector of a struct (d)P2P: doubly linked list 

Fig. 2. Communication Performance Comparison of LAM MPI and GOOMPI 



5 Related Work 

MPI-2 [4] does have a C++ binding, unfortunately they are just simple class 
wrappers of MPI C functions. It does not support full object-oriented or generic 
programming paradigm. 

A similar work to GOOMPI is the Object Oriented MPI (OOMPI). It is an 
object-oriented approach to MPI. It supports messages composed of user-defined 
types which are derived from a common base class called 00MPI_User_type, with 
specific data members and default constructors defined in them. This invasive 
approach implies that it is impractical to pass messages based on STL containers 
or classes from other existing libraries using OOMPI. 

The Generic Message Passing Framework (CMP) [16] [17] is an attempt 
which follows generic programming techniques to present a single message- 
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passing programming model for SMP clusters. The GMP provides a brand-new 
message-passing library which is similar to MPPs message-passing part, with 
optimizations for communications between threads within a SMP node. 

6 Conclusion and Future Work 

GOOMPI makes full use of object-oriented and generic programming techniques 
to support passing messages based on user-defined dynamic data structures in 
G-l— 1 -. It has many advantages: 

Abstraction Programmers using GOOMPI are able to pay more attention to 
the algorithms and overall structure of parallel programs, instead of running 
into the boring and error-prone implementation details of message packing 
/ unpacking and sending / receiving. 

Extensibility By supporting noninvasive serialization of user-defined types, it 
can be used in conjunction with G-l — h STL as well as other libraries (such 
as MTL and the Boost Graph Library (BGL) [21]) easily. 

Efficiency In virtue of generic programming techniques, there is almost no 
overhead introduced by abstraction. Programs written in GOOMPI have a 
performance on par with or even better than that of their MPI counterpart 
written in languages such as G or FORTRAN. 

Standard Conforming GOOMPI does not depends on language extensions. Fur- 
ther, it adopts a standard iostream-style interface for message-passing. The 
implementation of MPI streams and parallel I/O streams all follow the lay- 
ered design strategies of standard stream classes. Hence GOOMPI is able to 
support passing messages of arbitrary size. 

Type Safety Message type checking policy in GOOMPI enables type checking at 
both compile time and runtime. This reduces the possibility for programmers 
to make mistakes on message-passing. GOOMPI programs are likely to be 
more robust and less error-prone. 

With the help of GOOMPI, programmers can map parallel algorithms into 
high quality parallel programs intuitively, rapidly and effectively. GOOMPI is 
also of help in the design of parallel algorithms. 

Future work includes further enhancements and optimizations of GOOMPI. 
We are considering about building a generic library and framework based on 
GOOMPI to support generic parallel and distributed data structures, and facil- 
itate the parallelization of existing generic libraries such as MTL, Boost. uBLAS 
[18] and Blitz-|— I- [19], etc. 

We also prepared to integrate GOOMPI with a thread-level lightweight paral- 
lel library called Parallel Multi-Thread Library (PMT) developed by the authors 
to provide both a unified message-passing model as well as a unified data-parallel 
model for aiding parallel algorithm design and implementation on different par- 
allel architectures. 
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Abstract. This paper describes a rule-based generic programming and 
simulation paradigm, for conventional hard computing and soft and innovative 
computing e.g., dynamical, genetic, nature inspired self-organized criticality 
and swarm intelligence. The computations are interpreted as the outcome 
arising out of deterministic, non-deterministic or stochastic interaction among 
elements in a multiset object space that includes the environment. These 
interactions are like chemical reactions and the evolution of the multiset can 
mimic the evolution of the complex system. Since the reaction rules are 
inherently parallel, any number of actions can be performed cooperatively or 
competitively among the subsets of elements. This paradigm permits carrying 
out parts or all of the computations independently in a distributed manner on 
distinct processors and is eminently suitable for cluster and grid computing. 



1 Introduction 

Most systems we observe in nature are eomplex dynamieal systems that eonsist of a 
large number of degrees of freedom. They may eontain several inhomogeneous 
subsystems that are spatially and temporally struetured on different seales and 
eharaeterized by their own dynamies. Sueh eomplex systems often exhibit eolleetive 
(“Emergenee”) behaviour that is diffieult to model. Stoehastie and ehaotie dynamieal 
systems provide an effieient methodology for modelling and simulation of eomplex 
systems by eapturing the behaviour of the system at different spatial and temporal 
seales. The simulation based approaeh of a stoehastie or ehaotie dynamieal system 
ean be viewed as ""soft computation”, sinee unlike in eonventional eomputation where 
exaetness is our goal, we allow for the possibility of error and randomness. 

This paper deseribes a generie multiset programming paradigm for the simulation of 
eomplex systems. This paradigm permits us to write a generie program [2], [3] ealled 
a program shell - that implements the eommon eontrol stmeture. It ineludes a few 
unspeeified data types and proeedures that vary from one applieation to another. 
Henee, this Unified Multiset Simulation Paradigm (UMSP) ean be used for all 
conventional algorithms [15], Tabu seareh, Markov ehain Monte Carlo (MCMC), 
Partiele Filters [6], Evolutionary algorithms-elassifier systems, bueket brigade 
learning , Genetic algorithms and Programming [14], Immunocomputing, Self- 
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organized criticality [4] and Active Walker models (ants with scent or multiwalker- 
paradigm where each walker can influence the other through a shared landscape 
based on probabilistic selection [9], and Biomimicry [16]. Also it is applicable to 
non-equilibrium systems using oscillatory mechanisms involving catalytic reactions - 
as for example of producing ATP (Adenosine triphosphate) from ATP [8]. 

Structure of Unified Multiset Simulation Paradigm (UMSP) 

The UMSP has the following features: 

(i) One or more object spaces (multisets) that contain elements whose information is 
structured in an appropriate way to suit the problem at hand. 

(ii) A set of interaction rules that prescribes the context for the applicability of the 
rules to the elements of an object space. Each rule consists of a left-hand side (a 
pattern or attribute) of named objects and the conditions under which they interact, 
and a right hand side that describes the actions to be performed on the elements of the 
object space, if the rule becomes applicable based on some deterministic or 
probabilistic criteria. 

(iii) A control strategy that specifies the manner in which the elements of the object 
space will be chosen and interaction rules will be applied, the kinetics of the rule- 
interference (inhibition, activation, diffusion, chemotaxis) and a way of resolving 
conflicts that may arise when several rules match at once. 

(iv) A mechanism to evaluate the elements of the object space to determine the 
effectiveness of rule application (e.g., evaluating fitness for survival). 

Thus, UMSP provides a stochastic frame -work of “generate and test” for a wide 
range of problems, Yao [19], and Michalewicz and Fogel [14]. Also the system 
structure of UMSP consisting of components and their interaction is supported by 
contemporary software architecture design [2]. 

Computational Features of UMSP 

The UMSP has the following computational features: 

(i) Interaction -Based: The computations are interpreted as the outcome of interacting 
elements of the object space that produce new elements (or same elements with 
modified attributes) according to specific rules. Hence the intrinsic (genotype) and 
acquired properties due to interaction (phenotype) can both be incorporated in the 
object space. Since the interaction rules are inherently parallel, any number of actions 
can be performed cooperatively or competitively among the subsets of elements, so 
that the new elements evolve toward an equilibrium or unstable or chaotic state. 

(ii) Content-based rule activation: The next set of rules to be invoked is determined 
solely by the contents of the object space as in chemical reactions. 

(iii) Pattern matching: Search takes place to bind the variables in such a way to satisfy 
the left hand side of the rule. This characteristic of pattern (or attribute) matching 
makes the UMSP suitable for innovative computing. 

(iv) Suitable for deterministic, non-deterministic and probabilistic modes. 

(v) Choice of objects, and actions: We can use strings, arrays, sets, trees and graphs, 
multisets, tuples, molecules, particles and even points, as the basic elements of 
computation and perform suitable actions on them by defining a suitable topology, 
geometry or a metric space. 

We describe in Sections 2 and 3, the general properties of rule based paradigms. In 
Section 4 we give examples for UMSP. Section 5 contains the conclusion. 
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2 Rule -Based Programming Paradigm 

Specification: 

The main feature of the rule - based paradigm is the specification of the program: 

G(R, A)(M) = If there exists elements a, b, c,.. in an object space M such that an 
interaction rule R ( a, b, c,... ) involving elements a , b, c is applicable then 
G(R, A)((M- {a , b, c,.. }) + A( a, b , c,... )) else M. 

Here M denotes the initial object space with components of appropriately chosen 
data type. This is a multiset or a bag in which a member can have multiple 
occurrences, Calude et al. [5]. The operator - denotes the removal (annihilation) of the 
interacted elements; it is the multiset difference; the operator + denotes the insertion 
(or creation) of new elements after the action A; this is multiset union of appropriately 
typed components. Note that R is a condition text (or interaction condition that is a 
boolean) that is used to check when some of the elements of the object space M can 
interact. The function A is the action text that describes the result of this interaction. 
Note that both R and A are exact and deterministic. Testing for R involves a 
deterministic search, and evaluation of truth or falsity of Boolean predicates. Also 
actions performed in A are assumed to be exact. 

The function R can be interpreted as the query evaluation function in a database M 
and the function A can be interpreted as the updating function for a set of database 
instances. Hence, if one or several interaction conditions hold for several non-disjoint 
subsets of object space at the same time, the choice made among them can be 
nondeterministic or probabilistic. This leads to competitive parallelism. Then the 
actions on the chosen subset are executed atomically and committed. That is, the 
chosen subset undergoes an 'asynchronous atomic update'. This ensures that the 
process of matching and the follow-up actions satisfy the four important properties 
used in Transaction Processing [13] namely, ACID properties: Atomicity 
(indivisibility and either all or no actions or carried out). Consistency (before and 
after the execution of a transaction). Isolation (no interference among the actions). 
Durability (no failure). Once all the actions are carried out and committed the next set 
of conditions are considered. As a result of the actions followed by commitment, we 
derive a new database; this may satisfy new conditions of the text and the actions are 
repeated by initiating a new set of transactions. These set of transformations halt 
when there are no more transactions executable or the database does not undergo a 
change for two consecutive steps indicating a new consistent state of the database. 

However, if the interaction condition holds for several disjoint subsets of elements 
in the database at the same time, the actions can take place independently and 
simultaneously. This leads to cooperative parallelism. 

Deterministic and Nondeterministic Iterative Compntation: 

This consists of applications of rules that consume the interacting elements of the 
object space and produce new or modified elements in the multiset. This is essentially 
Dijkstra’s guarded command program. It is well-known that the Guarded command 
approach serves as a universal distributed programming paradigm for all conventional 
algorithms with deterministic or nondeterministic components [10] So we will not 
elaborate on this aspect any further. 

Termination: For the termination of rule application, the interaction conditions R 
have to be designed so that the elements in the object space can interact only if they 
are in opposition to the required termination condition. When the entire elements 
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meet the termination eondition, the mles are not applieable and the eomputation halts 
leaving the objeet spaee in an equilibrium state (or a fixed point). 

Non-termination, instability, chaos: These eases arise when the rules eontinue to 
fire indefinitely as in ehemieal oseillations. Then the objeet spaee ean be in a non- 
equilibrium state. It is also possible that the evolution leads to instability and ehaos of 
the deterministie iterative dynamies. 

For example, eonsider the rule-based iterative dynamieal system: For X(0) in the 
range [-1,1], if X(i) > 0 then G(X(i-l-l)) = -2X(i) -l-l ; else G(x(i-l-l)) = 2X(i)-l-l.The 
rules X(i)>0 and X(i)<0, are mutually exelusive and non-eompetitive; they generate a 
ehaotie dynamieal system, unstable, having a dense orbit in [-1,1]. 



3 Stochastic Rule-Based Paradigm 

The introduetion of stoehastie meehanism (randomness) in a rule-based system has 
several advantages: 

(i) It provides ergodieity of seareh orbits. This property ensures that searehing is done 
through all possible states of the solution spaee sinee there is a finite probability that 
an individual ean reaeh any point in problem spaee with one jump. 

(ii) It provides solution diseovery eapabilities (as in genetie programming) and 
enables us to seek a global optimum rather than a loeal optimum. 

(iii) It euts down the average running time of an otherwise worst-ease running time- 
algorithm. We aehieve this gain by produeing an output having an error with a small 
probability. 

(iv) Applieable to problems in many diseipline; Geneties (genetie algorithms); 
Thermodynamies (simulated annealing), Statistieal Meehanies (Partiele transport); 
Complex Systems (Aetive -walker. Self-organization and pereolation models). 

The unified multiset rule-based Simulation paradigm (UMSP) is obtained from the 
rule-based system deseribed in Seetion 2, by introdueing probabilities for seleetion to 
test whether one or more reaetion eonditions hold for several non-disjoint subsets at 
the same time. In this ease, the ehoiee made among these subsets is determined by a 
random number generator to randomly seleet the elements of the multiset with a 
probability p, test for the reaetion eonditions, and then perform the required aetions. 
UMSP is defined by the funetion: 

PG (R (p(i). A) (M) = if there exists elements a,b,e,. .. belonging to an objeet spaee M 
(a multiset) sueh that R(a, b, e,...) then G(R,A)((M- {a, b, e,.. }) + A(a, b, e,..)) else M 
where eaeh of the possible number of subsets i that satisfy the eonditions R is 
randomly ehosen with an appropriate probability p(i) and the eorresponding text of 
aetion A is implemented and the eomponents of the multiset are updated 
appropriately. Further, if p(i) is not speeified in a eomponent program , the ehoiee 
ean be deterministie or nondeterministie. Thus a eomposite program ean eontain 
within itself the deterministie, nondeterministie and probabilistie eomponents. 

The implementation of UMSP eonsists of the following four basie steps: 

Step 0: Initialization: Initializing the multiset representing the problem domain. 

Step 1: Seareh: Deterministie or random searehing for the eandidate elements that 
satisfy a given rule (interaetion eondition) exaetly or within a probabilistie bound.. 
Step 2: Rule Applieation: Carrying out the appropriate aetions on these ehosen the 
elements as dietated by the given rule. 
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Step 3: Stopping: It is typical in probabilistic method, not to explicitly state a 
stopping criterion. A key reason for this is that the convergence theory can provide 
only asymptotic estimates, as the number of iterations goes to infinity. However, in 
practice, we need to choose a suitable stopping criterion for the given problem - 
otherwise, we may be wasting the resources . 

In step 3, we may use various acceptance criteria; these may be involve evaluating 
an individual element or a selected subset or the whole object space; that is, the 
evaluation of the object space can take place at different levels of granularity 
depending upon the problem domain. Also, the acceptance criteria may be chosen 
dependent or independent of the number of previous trials and the choice of 
probabilities can remain static or can vary dynamically with each trial. Thus 
depending upon the evaluation granularity, acceptance criteria and the manner in 
which the probability assignments are made, we can devise various strategies by 
suitably modifying the skeletal structure of UMSP. For example one may choose to 
select a reaction rule from a rule -base probabilistically or vary the frequency of 
application of competing rules. Also one may carry out any operation 
probabilistically. UMSP is suitable to optimize the structure of the model used as in 
Genetic Programming, or optimize its parameters as in Genetic algorithms. 



4 Examples for Realisation of UMSP 

Practical realisation of the UMSP and application to many different types of 
algorithms can be achieved through a coordination programming language, Multran, 
using Multiset and the concept of transactions [13]. Also UMSP can be implemented 
in the grid and cluster-computing environment using MPI [7,17]. Due to lack of space 
we can give only two examples. 

(i) Swarm and Ant Colony Paradigm 

A swarm (flock of birds, ants, cellular automata) is a population of interacting 
elements that can optimize some global objective through cooperative search of space 
[9]. Here, individual elements in the multiset are points in space, and change over 
time is represented as movement of points, representing particles with velocities and 
the system dynamics is formulated in UMSP using the rules: 

1. Stepping rnle: The state of each individual element is updated in many 
dimensions, in parallel, so that the new state reflects each element’s previous best 
success ; e.g., ,the position and momentum (velocity) of each particle. 

2. Landscaping rnle: Each element is assigned a new best value of its state that 
depends on its past best value and a suitable function of the best values of its 
interacting neighbours, with a suitably defined neighbourhood topology and 
geometry. 

Rule 1 reflects the betterment of the individual, while rule 2 reflects the betterment of 
the collection of the individuals in the neighbourhood as a whole, by evaluating the 
relevance of each individual and providing support for its activity . These two rules 
permit us to model Markovian random walks which is independent of the past history 
of the walk and non-Markovian random walks, dependent upon past history- such as 
self-avoiding, self-repelling and active random-walker models. This can result in a 
swarm (a self - organizing system) whose global nonlinear dynamics emerges from 
local rules due to stochasticity or chaos introduced by the parameter variation. Also, 
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interesting new properties may show up- low dimensional attractors, bifurcations and 
chaos and various kinds of attractors having fractal dimensions presenting a swarm - 
like, flock-like appearances depending upon the Jacobian of the mapping; Wolfram 
[18]. 

(ii) Discrete Adaptive Stochastic Optimization 

Consider the following discrete stochastic optimization problem. Let 
© = {1,2,..., 5} denote a finite set and consider the following problem: Compute 

e* = rain g^QE{X„{e)} 

where ii denotes mathematical expectation and for any fixed 6 , {X^{6)] 

denotes a sequence of independent and identically distributed (iid) random variables 
that can be generated for any choice of ^ e © . If the density function of {0) is not 

known, it is not possible to analytically evaluate the above expectation and hence 0* . 
In such a case, one needs to resort to simulation based stochastic approximation to 
compute the optimal solution 9* . 

A brute force approach of computing the optimal solution to the problem involves 
exhaustive enumeration over all © and proceeds as follows: For each 9 
generate a large number N of random samples X^{9),X2{0),...X f^{0) . Then 
compute an estimate of E{X^{9)) using the sample average (arithmetic mean) 

{9) = {X, {9) + X 2(9) + ... + X 2, {9) / N. 

By Kolmogorov’s strong law of large numbers (which is one of the most fundamental 
consequences of the ergodic theorem for iid processes), G}^{9) ^ E{X^{9))wiih 
probability one as N ^co. This and the finiteness of © imply that 

argmax^^Q ^ argmax^^g, E{X„(9)} as N -)■ co. 

However, the above brute force procedure is inefficient - evaluating G^ (9) at values 
9 with 9 7^ 9* is wasted effort since it contributes nothing towards evaluating 
Gj^j{9*). What is required is an intelligent dynamic scheduling (search) scheme that 
decides at each time instant which value of 9 to evaluate next, given the current 

estimates, in order to converge to the maximum 9* with minimum effort. 

Here we present a globally convergent discrete stochastic approximation algorithm 
based on the random search procedures [1,11,12,20]. The basic idea is to generate a 
homogeneous Markov chain taking values in © which spends more time at the global 
optimum than at any other element of © . This consists of the following skeletal 
structural steps of UMSP . 

Step 0: Initialization: At time n=0, select starting point ^gS© randomly with 
uniform probability. Set Z)g = Cg ^ , where e, denotes the S dimensional unit vector 

with 1 in the i th position and zeros elsewhere. Set initial solution estimate 9q = Xq. 
Step 1: Search (Random Sampling): At time n, sample u„ & @-{9„) with uniform 
distribution. 

Evaluate the random sample costs X^ {9^ ) and X^ (m„ ) . 
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Step 2. R-Ule .^tpplicatioTi. If ^ then set else set ^n+\ ~ '^n. 

Update duration time vector at time n+1 asD„+j = D„ + 

A 

Update estimate of maximum at time n as ^ « = arg max , ^ p 2 s} ^« + 1 (0 

Step 3: Stopping: Choose stopping criteria appropriately; if not satisfied set 
n^n + \ and go to Step 1 . 

Then as proved in [1], under suitable conditions (e.g., if the density function with 
respect to which the expected value is defined above is symmetric) the estimate 

On generated by the above random search stochastic approximation algorithm 

converges with probability one to the global optimum 6* . It is also shown in [1], that 
the algorithm is attracted to the global optimum, i.e., the algorithm spends more time 
at the global optimum than any other candidate value. That is, for sufficiently large n, 

the duration time vector Z)„ has it maximum element at 0 . 

The above algorithm has several applications. It can be used to learn the behaviour 
of an ion channel (large protein molecule) in a nerve cell membrane to estimate the 
Nemst potential efficiently [11]. In [12,20] its recursive version optimizes the 
spreading code of a CDMA spread spectrum transmitter over a fading wireless 
channel. 



5 Conclusion 

The introduction of stochastic/chaotic mechanisms in a multiset chemical reaction 
model provides a soft-computational model to study evolutionary biological, chemical 
and physical systems interacting with the environment. The paradigm described here 
provides a new environment using a distributed architecture for swarm intelligence, 
membrane and bio-immunology computing, adaptive stochastic optimisation and self 
organized criticality. This simulation paradigm is well-suited for cluster and grid 
computing using MPI. 
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Abstract. In the embedded software development environments, devel- 
opers can concurrently debug a running process and its child processes 
only by using multiple gdbs and gdbservers. But it needs additional cod- 
ing and messy works of activating additional gdb and gdbserver for each 
created process. In this paper, we propose an efficient mechanism for con- 
current debugging of multiple remote processes in the embedded system 
environments by using the library wrapping mechanism without Linux 
kernel modification. Through the experimentation of debugging two pro- 
cesses communicating by an unnamed pipe in the target system, we show 
that our proposed debugging mechanism is easier and more efficient than 
preexisting mechanisms. 



1 Introduction 

Currently, the gdb has been popularly used as a remote debugging tool in the 
embedded Linux software developments. By running the gdb in the host sys- 
tem and the gdbserver in the target system, developers can debug a remote 
process running in the target system [1][2]. However, developers must insert a 
“sleep” function into the debugged program in order to concurrently debug a 
newly created child process of the current debugged process. Developers also 
need additional gdbserver in the target system and connect it to the blocked 
child process. In the host system, additional gdb is required to connect to the 
new gdbserver in the target system. Therefore, developers must have the same 
number of gdbs in the host system and gdbservers in the target system as the 
number of the debugged processes. 

A gdbserver in the target system provides developers with the ability of 
debugging a process by using ptrace system call in the Linux systems. But the 
ptrace system call needs the parent-child relationship between a gdbserver and 

* This work was supported by Korea Research Foundation Grant (KRF-2003-041- 
D20420). 
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a debugged process [3]. When a debugged process creates a new process, the 
parent-child relationship is not established between a gdbserver and the newly 
created child process. Developers need to insert the sleep code into a newly 
created child process code. When the newly created process is blocked by the 
sleep code in the target system, developers run a new gdbserver in the target 
system and connect it to the blocked process. Developers also run a new gdb in 
the host system and connect it to the new gdbserver in the target system. When 
two connections are established, developers can debug the newly created process 
in the target system by using the gdb in the host system and the gdbserver in 
the target system. 

2 Our Proposed Mechanism 

In this paper, we propose a new debugging mechanism that supports concur- 
rent debugging of multiple remote processes by using the mgdb library and the 
mgdbserver. Fig. 1 shows the overview of our proposed mechanism that supports 
the concurrent debugging of multiple remote processes. The mgdbserver in the 
target system communicates with the gdb in the host system. Developers can 
concurrently debug multiple remote processes by selecting the process intended 
to debug at desired time by using the mgdbserver. Whenever a debugged process 
creates a new child process, the mgdbserver runs a new gdbserver in the target 
system and connects it to the newly created child process automatically in order 
to support the concurrent debugging of the newly created child process. 



Target Machine 




Fig. 1. Overview of our concurrent and remote debugging mechanism 



In order to support concurrent debugging of multiple remote processes, the 
mgdbserver must know when the current debugged process invokes fork system 
call. In this paper, we use the mechanism of wrapping the glibc library in order 
to intercept the system call that a currently debugged process invokes. When 
the currently debugged process calls the function in the glibc library, the library 
wrapping scheme intercepts the function call and calls the same name function 
in our mgdb library. The called function in our mgdb library executes the code 
that is needed for debugging of multiple processes before calling the function in 
the glibc library that is intended to be called originally. In order to intercept 
system call, we use the interposition mechanism of Linux dynamic linker [4] by 
preloading our mgdb library before the glibc library. 
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Fig. 2. Flow of debugging newly created process after mgdbserver receives signal 



When the currently debugged process creates a new child process, our mgdb 
library blocks the newly created process in order to prevent the process from 
terminating. It also informs the mgdbserver that the currently debugged process 
creates a new child process by the signal. The mgdbserver runs a new gdbserver 
and connects it to the newly created process. As shown in Fig. 2, when the 
mgdbserver receives the signal from our mgdb library, it creates a new gdbserver. 

In order to become the parent process of the newly created process, the 
new gdbserver sets the PT_PTRACED value to a ptrace variable of the newly 
created process by invoking ptrace system call. When developers want to change 
the currently debugged process, they can select the process to debug by using a 
gdb in the host system. When the new debugged process is selected by a gdb, 
the mgdbserver passes debugging request from the gdb to the gdbserver. 

3 Experiment and Performance Analysis 

The scenario used in the experiment is as follows. The parent process creates a 
child process and sends a string through the unnamed pipe shared by the child 
process. After the child process receives the string from the parent process, it 
rearranges the string in reverse order and sends the string to the parent process 
through the unnamed pipe. After the parent process receives the string, it prints 
the string sent by the child process. 
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Developers can see the debugged status information of all processes created 
by the currently debugged process by typing “show-remoted-debugee” at the 
“gdb” prompt in the host system. By selecting the process identifier number, 
they also can change a specific process for debugging through “change-remote- 
debugee” command with “pid” argument. 



Table 1. Comparison of TotalView with our mgdb library and mgdbserver 





ETNUS ’s Totalview 


Our proposed debugger 


Test program image size 


53658 bytes 


23973 bytes 


Library linking mechanism 


staic 


dynamic 


Remote debugging 


no supporting 


supporting 



In this experiment, we focus on the ability of our debugger tool to support 
concurrent and remote debugging of the parent process and the newly created 
child process by selecting the process intended to debug using only one gdb in the 
host system. As shown in Table 1, we compare our proposed scheme with ETNUS 
TotalView program that supports debugging of multiple processes [5]. However, 
the library linking mechanism in ETNUS TotalView supports only static linking, 
therefore the size of the debugged program in ETNUS TotalView is larger than 
that in our proposed scheme. ETNUS TotalView also cannot support remote 
debugging. 

4 Conclusion 

In this paper, we presented a new concurrent debugging mechanism for remote 
processes through the design and implementation of the mgdb library and the 
mgdbserver. In our proposed scheme, developers can debug all debugged pro- 
cesses in the target system by selecting the debugged process among them 
through one gdb in the host system. Compared with the preexisting mechanism, 
our proposed scheme provides easier and more efficient concurrent debugging for 
multiple remote processes in the target system. 
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Abstract. The efficiency of a large-scale parallel computer is critically depend- 
ent on the performance of its interconnection network. Analytical modelling 
plays an important role towards obtaining a clear understanding of network per- 
formance under various design spaces. This paper proposes an analytical per- 
formance model for circuit-switched hypercubes in the presence of multiple 
time-scale correlated traffic which can appear in many parallel computation en- 
vironments and has strong impact on network performance. The tractability and 
reasonable accuracy of the analytical model demonstrated by simulation ex- 
periments make it a practical and cost-effective evaluation tool to investigate 
network performance under different system configurations. 



1 Introduction 

Multicomputers have been widely accepted as the solution for solving grand chal- 
lenge problems in high performance computing. Interconnection network [1] is a 
critical architectural component in multicomputer systems as any interaction between 
the processors ultimately depends on its effectiveness. The hypercube has been one of 
the popular network topologies in multicomputers owing to its desirable properties, 
such as regular structure, symmetry, low diameter and high connectivity to deal with 
fault-tolerance [2]. An n-dimensional hypercube has N = 2" nodes with 2 nodes in 
each dimension. Each node consists of a processing element and a router. 

The switching strategy determines how data in a message traverses its route from 
source to destination. The circuit switching has been widely employed in computer 
and telecommunication systems [1]~[7]. Such a switching strategy is divided into two 
phases: (1) circuit establishment phase; (2) transmission phase. A dedicated path is 
set up prior to the transmission of data. A noticeable advantage of circuit switching is 
due to the fact that it does not require packetizing. Moreover, the low buffer require- 
ment enables the construction of small, compact, and fast routers [1]. 
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Traffic loads generated by real-world applications have very strong effects on the 
performance of interconnection networks. Many recent studies [8]~[10] have demon- 
strated that realistic traffic can reveal burstiness and correlations among inter-arrival 
intervals over a number of time scales. At every time scale, traffic bursts consist of 
bursty subperiods separated by less bursty subperiods. This fractal-like behaviour of 
network traffic can be much better modelled using statistically long-range dependent 
processes, which reveal totally different theoretical properties from the conventional 
Poisson process [9]. A stochastic process X with autocorrelation function r{k) is 
long-range dependent if its autocorrelation decays hyperbolically fast, i.e. 
r{k)~\k\~^ , as |A:|^oo with 0</?<l[ll]. The Hurst parameter, H = \-pH 
where 0.5<H <l , is commonly used to measure the degree of long-range depend- 
ence. 

Sahuquillo et al. [10] have traced some typical parallel applications and revealed 
that workloads generated by many scientific and engineering computations exhibit the 
fractal-like nature. In an effort towards providing cost-effective tools that help inves- 
tigating the network performance with various design alternatives and under different 
traffic conditions, this paper proposes an analytical model for hypercube networks 
with circuit switching in the presence of multiple time-scale bursty and correlated 
traffic. The validity of the model is demonstrated by comparing analytical results to 
those obtained through simulation experiments of the actual system. 

The rest of this paper is organized as follows. Section 2 presents the derivation of 
the analytical model. Section 3 validates the model through simulation experiments. 
Finally, Section 4 concludes this study. 



2 Derivation of the Performance Model 

The analytical model is based on the following assumptions [2], [4]~[7], [12], [13]. 

1) Traffic generated by each node follows an independent stochastic process with a 
mean arrival rate T and autocorrelation at lag 1 r(l) = pX . Traffic burstiness and 
correlations appear over t time scales. The autocorrelation decays hyperbolically 
with Hurst parameter H as the time scale increases. 

2) Message destination nodes are uniformly distributed across the network nodes. 
Message length is M flits. 

3) The local queue in the source node has infinite capacity. Each physical channel is 
divided into V virtual channels [1]. 

4) Messages are routed adaptively through the network [1,6] using one of the avail- 
able shortest paths. 

Under the uniform traffic pattern, the average message distance in an n- 
dimensional hypercube is given byD = n/2 [2]. As the network latency consists of 
the time to establish a path and the time to transmit a message, it can be calculated as 
T=E+D+M where E represents the path set-up time. Since the Laplace-Stieltjes 
transform (LST) of the sum of independent random variables is equal to the product 
of their transforms [14], the LST of T can be written as 
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( 1 ) 



where E* (s) denotes the LST of the time to set up a path. 

Following the approach proposed in Ref. [15], traffic burstiness and correlations 
over multiple time scales can be modelled by the superposition of L two-state 
Markov-Modulated Poisson Processes (MMPP) [16], typically L = 4. We use the 
MMPP*'^ with superscript i to denote the i-th MMPP (l<i<L). A two-state 
MMPP*^'^ can be parameterised by the infinitesimal generator. A, , and rate matrix 
B, as [18] 



A,- 



-^11 
. ^2,- 



-^21 



and B, = 



Mi 

0 



0 

'^2i 



( 2 ) 



The element is the transition rate from state 1 to 2 of the MMPP*'^ and (? 2 ,- is 
the rate out of state 2 to 1 . and Mi ^^e the traffic rate when the MMPP*'^ is in 

state 1 and 2, respectively. The fitting algorithm described in Ref [15] derives the 
parameters > Mi > Mi ^ach MMPP*'* ( 1 <i< L) for matching the mean 

and autocorrelation function over different time scales. The superposition of the 
MMPP*'*s ( 1 < ; < T ) gives rise to a new MMPP with 2^ states and its parameter 
matrices, and B^ , can be computed as (the symbol “ © ” denotes the Kronecker 
sum [18]) 

A^ = Aj © A 2 © • • • © A^ and B^ = Bj © B 2 © • • • © B^^ (3) 



A message enters the network through one of the V injection virtual channels 
with even probability 1/K Given that the process resulting from the splitting of an 
MMPP has the same infinitesimal generator as the original MMPP [17], the infini- 
tesimal generator A„ and rate matrix B^ , of the resulting MMPP that models the 
traffic on an injection virtual channel in the source node are given by 

A^ = A, and B^ =B,/F (4) 



To determine the mean waiting time that a message experiences before entering the 
network, the injection virtual channel in the source node is modelled as an 
MMPP/G/1 queueing system. The mean waiting time, Ws , can be expressed as [18] 



Ws^ — 

2ju 



^ 2// + T,r*^>-2r((l-//)g + r;rB,)(A,+e7T)-‘k 

l-ju 



-MJ 



(2) 



( 5 ) 



In the above equations, T and T denote the first two moments of the message 
service time and can be computed by differentiating T {s) and setting ^=0 [14]. e 



is a unit column vector of length 2^ . The traffic intensity ju = TX^ , where is the 



mean traffic rate and given by A^ - . X is a column vector containing the ele- 

ments on the main diagonal of B^, , and n is the steady-state vector of the MMPP. 

When a message header is blocked at an intermediate node, it experiences connec- 
tion failure and will make a new attempt to establish a path from the source node. Let 
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Pbi denote the probability that the header suffers bloeking after making i hops. The 
probability of a sueeessful eonneetion, ft , and a eonneetion failure, iy, during a 
single eonneetion attempt ean be written as 

D-\ D-\ 

ft = n(l--P^/) and Pf = \-Ps = \- na-A) (6) 

i =0 /=0 

A message may need a number of, say, r {r = \, 2, ..., oo) , eonneetion attempts in 
order to sueeessfully establish a path. The traffie due to the r-th attempt of the 
MMPP*'^ ( 1 < i < Z. ) ean be modelled by a new two-state MMPP*"^* whieh is the re- 
sulted proeess from the splitting, with the probability Pf^ ^ , of the original MMPP*^'\ 
The infinitesimal generator A,y and rate matrix , of the MMPP*"^^ is given by [17] 

A,,=A, and (7) 

Superposing the traffie eaused by all r, (r = 1, 2, ..., oo) , eonneetion attempts of 

those generated by a souree node yields the effective traffie entering the network. 
Therefore, the effeetive traffie ean be modeled by the superposition of all MMPP*^"^^s 
with ( 1 < i < L ) and (r = 1, 2, ..., oo) . As the superposition of MMPPs gives rise again 
to an MMPP [18], the effeetive traffie from a given souree node ean be eharaeterised 
by a new multi-state MMPP. To ealeulate the parameter matriees of this new MMPP, 
we first use a two-state MMPP*'®' to mateh the superposition of all MMPP*'®'s 
with(r = 1, 2 , ..., oo) beeause these MMPPs model traffie burstiness and eorrelations 
over the same time seale. Using the parameter matriees of the MMPP*’®'s 
with(r = 1, 2, ...,oo) as input parameters, the method presented in Ref [6] for super- 
posing infinite eorrelated traffie streams ean be used to derive the infinitesimal gen- 
erator Ay and rate matrix By of the MMPP*’®'. Similarly, we separately match the 
superposition of the MMPP*‘®'s with(r = 1, 2, ..., oo) to a two-state MMPP*'®' with the 
resulting parameter matrices Ay and By . We then calculate the Kronecker sum of 
the parameter matrices of MMPP*'®', MMPP*^®', ..., MMPP*^®' to parameterise the 
multi-state MMPP that characterises the effective traffic entering the network from a 
given source node. So, the infinitesimal generator Ag and rate matrix Bg of the 
multi-state MMPP are given by 

Ag = Aig © Ajg ©■•■© A^g and Bg = Bjg © Bjg © ■ • • © B^g (8) 

A message may encounter blocking at any of the D intermediate nodes along its 
path. The probability, Pf , that a header experiences a connection failure at a node 
that is i hops away from the source can be expressed as 

Pf=Pkl[{l-Pbj) (9) 

7=0 

Taking into account the cases of a connection success and connection failures oc- 
curring at D possible nodes gives the average number of channels, c , traversed by a 
message during a single connection attempt 
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D-\ 



D-h 



c^D-Ps+J^i-Pf,^\\D{l-Pb,)+J^ 



i-\ 



iPb,\\{\-Pbj) 



( 10 ) 



i =0 1=0 i=0\ 7=0 J 

Under the uniform traffic pattern, using adaptive routing results in a balanced traf- 
fic load on all network channels. Examining Eq. (10) reveals that the average number 
of channels, c , traversed by a message during a single connection attempt is always 
less than n in an n-dimensional hypercube. This implies that the arrival traffic at a 
given network channel is a fraction of the effective traffic entering into the network 
from a source node. This fraction, / , can be estimated by 



Nc _ c 
Nn n 



( 11 ) 



Given that the MMPP is closed under the superposition and splitting operations, 
we use an MMPP*^^ to model the characteristics of the traffic on a network channel. 
The infinitesimal generator A^, and rate matrix of the MMPP*^^^ are given by [17] 



A, = A, and = fR^ (12) 

After defermining the characteristics of traffic on network channels, the joint prob- 
ability yj that i , (0 < ; < F) , virtual channels are busy and the MMPP modelling 

the traffic on network channels is at state j , (1 < y < 2^ ) , can be calculated using a 
bivariate Markov chain [12]. The detailed derivation of, P(t j), and calculation of the 

average degree of virtual charmel multiplexing, V , can be found in Ref [12]. In the 
hypercube, a message is blocked after making i hops if all possible virtual channels 
at the remaining (£) - 1 ) dimensions are busy. The probability, Ph, , can be written as 

' (0</<T)-l) (13) 



Let denote the expected time for the header to reach the destination from the 
current node. If the header succeeds in reserving the required virtual channel and 
advances to the next node, the residual expected time becomes . This case occurs 
with probability (1 -Ph, ) . On the other hand, if the header encounters blocking and 
backtracks to the source node, the residual expected time is Eq . Therefore, P, satis- 
fies the following difference equations [6] 

E. = (1 - Pbi )(P,._n +l) + Pbi {Eq +i) (0 < ; < Z) - 1) and P75 = 0 (1 4) 



Solving the above equations yields Eq as 



1 



ZL± 
\-Pb. 



-Pb. 



a = 0 ) 

(1 < ; < D) 



0 



and yi=\ 






1-1 



-(i-2)Ph,,i-l 

1-A=i 




(15) 



Xi 



(i = 0) 

(1 < ; < D) 



( 16 ) 
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The mean time to set up a path is given by E = Eq+ D. Due to the requirement of 
analytie simplieity and praetieality, we approximately model the distribution of the 
path set-up time by an exponential distribution. So E* (s) ean be expressed as [14] 

E\s) = — (17) 

a + s 

where a is seleeted to mateh the mean path set-up time and is given by a = 1! E . 

The mean message lateney is eomposed of the mean network lateney and the mean 
waiting time at the souree node. However, to model the effeet of virtual ehannel mul- 
tiplexing, the message lateney has to be sealed by the average degree of virtual ehan- 
nel multiplexing that takes plaee at a given physieal ehannel. Thus, we ean write [13] 

Latency = {f + Ws)V (18) 



3 Simulation Experiments 

We have developed a diserete-event simulator, operating at the flit level, in order to 
validate the above analytieal model. Eaeh simulation experiment was run until the 
network eonverged to its steady state. The eyele time in the simulator is defined as the 
transmission time of a single flit to eross from one node to the next. Message destina- 
tions are uniformly distributed aeross the network. Figures 1~3 depiet results for the 
mean message lateney predieted by the above model plotted against those provided 
by the simulator in the 4, 6 and 8-dimensional hypereubes, respeetively. Message 
length is M=32 and 64 flits. Number of virtual ehannels of per physieal ehannel is 
F=3, 5 and 7. Hurst parameters are El =0.6, 0.7 and 0.8; Parameter for eomputing 
autoeorrelation at lag 1 is yc> =0.7, 0.8 and 0.9. We have modelled burstiness over five 
time seales. The figures reveal that the simulation results elosely mateh those pre- 
dieted by the analytieal model in the steady state region. Its traetability makes it a 




Fig. 1. Latency predicted by the model and simulation in 4-dimensional hypercubes, K=3, 
// =0.6,/? = 0.7. 
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Fig. 2. Latency predicted by the model and simulation in 6-dimensional h 5 ?percubes, K=5, 
H=Q.l,p = 0.%. 




Fig. 3. Latency predicted by the model and simulation in 8-dimensional hypercubes, V=l, 
//=0.8,/o = 0.9. 

practical and cost-effective evaluation tool to study the performance behaviour of 
circuit-switched hypercubes in the presence of multiple time-scale bursty and corre- 
lated traffic. 



4 Conclusions 

There has been growing evidence over the past few years that traffic burstiness and 
correlation over many time scales appear in a variety of systems including local-area 
and wide-area networks, digitised multimedia systems, web servers, and parallel 
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computation systems. This fractal-like behaviour of traffic exhibits a totally different 
behaviour from the conventional Poisson process and has great impact on network 
performance. In an effort towards providing cost-effective tools for hypercube net- 
works, this paper proposes a analytical model for circuit-switched hypercubes in the 
presence of multiple time-scale bursty and correlated traffic, which is modeled by the 
by the superposition of a number of different two-state MMPPs. The validity of the 
model is demonstrated by comparing analytical results to those obtained through 
simulation experiments of the actual system. 
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Abstract. Leader election in a network is one of the most important problems 
in the area of distributed algorithm design. Consider any network of N nodes; 
a leader node is defined to be any node of the network unambiguously iden- 
tified by some characteristics (unique from all other nodes). A leader election 
process is defined to be a uniform algorithm (code) executed at each node of the 
network; at the end of the algorithm execution, exactly one node is elected the 
leader and all other nodes are in the non-leader state [GHS83, LMW86, Tel93, 
Tel95a, SBTSOl] In this paper, our purpose is to propose an election algorithm 
for the oriented hyper butterfly networks with C9(A^log N) messages. 



1 Hyper Butterfly Graphs 

1.1 Hypercube 

A hypercube iL„, of order n, is defined to be regular symmetric graph G = (U, E) 
where V is the set of 2” vertices, each representing a distinct n-bit binary number and 
E is the set of symmetric edges such that two nodes are connected by an edge iff the 
Hamming distance between the two nodes is 1 i.e., the number of positions where the 
bits differ in the binary labels of the two nodes is 1. For example, in ffs, the node 010 
is connected to three nodes 110, 000 and 011. It is known that the number of edges in 
Hn is n X 2"“^ and the diameter of iL„ is given by = n. 

1.2 Butterfly Graph 

A wrapped butterfly network, denoted by Bn, is defined [Lei92] as follows: a vertex is 
represented as (zn-i ■ ■ ■ zq,£), where Zn-i ■■■ zg is a n-bit binary number and I is an 
integer, 0 < f < n — 1. 

The edges of i?„ are defined by a set of four generators. Consider an arbitrary node 
(zn-i ■ ■ ■ zq,£) in Bn- We define a{£) = £+ l(modn) and j3{£) = £— l(modn). The 
four edges node {zn-i ■ ■ ■ zq,£) has can be derived by the following four generators: 

g{z„-i ■■■zo,£) = z„-i ■ ■■zo,a{£) 
g~'^{Zn-l ■■■Zo,£) = Zn-l ' ' ' Zg, j3{£) 

f{Zn-l ■■■Zg,£)= Zn-l ’ ' ’ Zt+iZlZl-i ■ ■ ■ Zg, a{£) 
f~'^{Zn-l---Zg,£) = Zn-l ' ' ' Z/3(^e) + iZ/^(^i)ZiS(^e)_i ■ ■ ■ Zg, P{£) 

Remark 1. We refer to the frist part in the butterfly label, i.e. Zm-i ■ ■ ■ Zg as comple- 
mentation index; and the second part, i.e. £, as permutation index. 
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1.3 Hyper-Butterfly Graph if 

Consider two undirected graphs G = (Vg, Eq) and H = {Vh, Eh)', the product graph 
G X H has node set Vg x Vh- Let u and v be any two nodes in G, and let x and y be 
any two nodes in iL; then, ((u, x), {v, y)) is an edge of G x El iff either (1) (u, v) is an 
edge of G and x = y,ov (2) (x, y) is an edge of H and u = v. 

Definition 1. A Hyper-Butterfly graph of order (dimension) (m + n) is 

defined as the product graph of a hypercube Hm of dimension m and a butterfly B„ of 
dimension n. 

In HB(^rn,n), c^ch node is assigned a label (xm-i ■ ■ - Xq, Zn-i ■ ■ ■ zq,£) where each 
Xi and Zj are binary bit, 0 < i < m— 1 and 0 < j < n—l.£ is an integer, 0 < £ < n—1. 
Xm-i • • • xo is the hypercube-part-label and (zn-i ••• zq,£) is butterfiy-part-label. The 
edges of the graph are defined by the following to + 4 generators: 

h^(Xm-l"'Xo,Zn-l-"Zo,£) = 

(Xm -1 ■ ■ ■ X^+iXiX^-i ■ ■ ■ Xo,Zn-l ■ ■ ■ Zo,£) Vz, 0<Z<TO-1 
g(xm-i '"Xo, Zn-1 ■■■zo,£) = (,Xm-i ' ' ' xq, Zn-1 ' ' ' zq , cx(£)) a(£) = 

£ + l(modn) 

• • • xq, z„-i • • • zq, /?W) fi{£) = 

£ — l(modrz) 

f {x-m— 1 ' ’ ’ Xq , Zji— 1 * * * Zq , £) — 

(Xm—1 * * * Xq, Zn—1 ' * * Za(£')-l-lZa(£)Za(£) — l ' ' * Zg, Oi(£)) 

f~^(x^-l ■■■Xo, Zn-1 ■■■Zo,£) = (Xm-1 ' ’ ’ a^O, Zn-1 ' ' ’ Z^+iX^Z^_l • • • Zg, P{£)) 

Remark 2. 

- The set of to+ 4 generators of the graph HB(^ni,n), El = {hi,0 < i < m, f, g, f~^ , 
g~^} is closed under inverse; in particular hi for all z is its own inverse, g is inverse 
of g~^ and / is inverse of thus the edges in HB(^ni,n) bidirectional. 

- For an arbitrary zz, n > 2, for any arbitrary node v of the graph HB^ni,n), '^(f) v 

where i5 G 17 = {/zz,0 < z < to, /, p, p”^}; also, for any two (5i, (52 G El, 
<5i(z;) S 2 {v). 

- Hyper butterfly graph HB(^ni,n) is a Cayley graph of degree to + 4. 

- For any to and n,n > 3, the graph (1) is a symmetric (undirected) regular 

graph of degree to+ 4; (2) has n x 2™+” vertices; and (3) has (to+4) xnx 2™+"~t 
edges. 

Definition 2. 

- The TO edges generated by the generators hi are called hypercube edges and the 
4 edges generated by either of the generators g, f, g~^, are called butterfly 
edges. 

- Any arbitrary node v = {h, b) G HB(^rn,n) has m hypercube neighbors {{h^fh), 

1 < z < to} (reached from v by the m hypercube edges) and has 4 butterfly 
neighbors {(h, 1 El j El E} (reached from v by the 4 butterfly edges). 
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Remark 3. Along any hypercube edge, only the hypercube-part-label of a node changes, 
and along any butterfly edges, only the butterfly-part-label changes. 



Remark 4. The labeling of a hyper-butterfly graph is not unique. There exist many pos- 
sible different label assignments with the same graph using traditional labeling scheme. 
We arbitrarily choose one such traditional labeling and refer to it as canonical labeling 
and will refer to the nodes using its canonical label. 

Definition 3. 

- VTe use Hm’ ’ to denote an m-dimensional hypercube subgraph of H B(^rn,n) 
where each node has the same butterfly-part-label (z, £). 

( h ^ 

- VTe use Bn ’ to denote an n-dimensional butterfly subgraph of HB^ra n) where 
each node has the same hypercube-part-label h. 

- 'We use Rn ’’ to denote a ring of n nodes where each node has the same 
hypercube-part-label h and same complementation index z = zq ■ ■ ■ 

- We use HRm\n*^ to denote the set of nodes that have the same complementation 
index z. This set of node is actually the product of a Hm ’ and Rn ’ ’ , with the 
same z value. 



2 Leader Election Algorithm in Hyper-Butterfly Graph 

Consider a hyper-butterfly graph HB(^m,n)\ ^ leader node is defined to be any node 
of the graph unambiguously identified by some characteristics (unique from all other 
nodes). A leader election process is defined to be an uniform algorithm executed at each 
node of the network; at the end of the algorithm execution, exactly one node is elected 
the leader and all other nodes are in the non-leader state. 

Remark 5. If each node knows its canonical label, this election process is trivial. Con- 
sider the node having the smallest label, i.e. (0 • • • 0, 0 • • • 0, 0) in HB(m^nf we can say 
that the node with this label will automatically become the leader and all other nodes 
are non-leaders. 

In this paper, as in [Tel95b], we assume that the nodes in the graph do not know 
their canonical labels. We will still refer to the nodes by some canonical labels for 
convenience, but these labels or names have no topological significance [Tel95b]. We 
also assume in this paper that the network is oriented in the sense that each node can 
differentiate the links incident to it by different generators, (in contrast, a node in an 
un-oriented star graph distinguishes its adjacent links by different but uninterpreted 
names). We define the direction of a link as the index of the generator that generates 
the link. So, the link that is associated with generator hi has direction i, 0<z<n— 1. 
And the link that is associated with generator g, g~^, /, f~^ has direction g, g~^, f, f~^ 
respectively. 

The whole elction algorithm in hyper-butterfly graph consists of three major steps. 
At different step, the graph is divided into different regions, and the leader for each 
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region is elected. At first step, the hyper-butterfly graph is divided into n x 

2” hypercubes. Within each hypercube, the nodes run a formerly proposed election 
algorithm for hypercubes. After this step, each hypercube will have a leader node. In 
the second step, the nodes with the same complementation index are considered in 
one region. For a certain complementation index z, the region is actually a ring of 
hypercube, i.e. . The hypercube leaders elected in the first step will compete 

with each other and elect one leader in for each different z value. The third 

\m,n) 

step is the final step, where the leaders elected in the second step compete with each 
other and elect one final leader for HB(^rn,n)- In the following sections, we discuss the 
detail of election algorithm at each step. 

Remark 6. It should be noted that in an oriented hyper-butterfly graph, each node can 
identify the region with the knowledge of direction of edges, i.e. one node can send a 
message to some other node in the same region by using a sequence of certain direc- 
tions. the node does not have to know its canonical label in order to identify the region. 

2.1 Election Algorithm in 

In there are n x 2” different butterfly labels, which we denote as 

0 < z < 2”, 0 < £ < n; and the nodes with the same butterfly part label form a 
hypercube of dimension m, which we denote as Hm ’ ■ 

In this step, each nodes first runs a leader elction algorithm for hypercube []. The 
algorithm uses only hypercube edges, i.e. edges with direction 0 to n — 1. At the end 
of this procedure, there will be one leader elected in for each different (z,f) 

value pair. The details of the algorithm is listed as follows: 

After the leader is set at Hm ’ • The leader will broadcast its id to all nodes in 
Hm’ ’ ■ Each node receives this broadcast message will save the leader’s id into a 
variable HL (abbreviation for hypercube leader). 

Lemma 1. This step requires less than 7.24 x n x messages [Tel95b], 

2.2 Election in 

{m,n) 

After the first step, there will be a leader in each hypercube Hm ’ ,(0<z<2”, 0< 

£ < n). We use {h{z, £), z, £) to denote the label of the leader in Hm’^’^K where h{z, £) 
specifies the hypercube part label of the leader. 

Remark 7. h{z, £) does not denote a particular function to derive the hypercube part 
label from z and £. h{z,£) only indicates that the hypercube part label of the leader 
varies for different hypercubes. Since the leader in Hm’^’^'^ is determinate for any (z, £) 
value pair, the hypercube part label of leader is also determinative and solely depends 
on the value of z and £. 

In the first step, the leader of each hypercube also broadcasts its id within the hyper- 

f # Z £') 

cube when it becomes the leader. After the broadcast procedure, every node in Hm ’ ’ 
will be informed with the id of leader, i.e. (h(z, £), z, £), and has variable HL set to it. 
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( 2 , 0 , 0 ) ( 2 , 0 , 1 ) ( 2 , 0 , 2 ) 




Butterfly Edges 



Hypercube Edges 



Direction of Ring! 
Message travels 



Fig. 1. Example of Leader Election in (2^ subgraph of i?-B(2,3)) 



The objective in the second step is to elect leaders in larger regions. In this step, 
the nodes in hyper-butterfly graph HB(^^ n) is considered to he grouped into 2” new 

regions: , 0 < z < 2", each consists of n hypercubes with the same com- 

plementation index, Hm ’ ^ 0 < £ < n. Each hypercuhe has one leader elected 
from the first step, and there are totally n x 2” such hypercube leader, with n in each 
new region . In the second step, each hypercube leader (h(z, £), z, £) invokes 

procedure HRElect to compete with other n — I hypercube leaders in the same re- 
gion HR^*^'*y Only one of them becomes the new leader. After every hypercube 
leader finishes procedure HRElect, there will he only 2” leaders left, with one in each 
^^{rnn)’ 0 < z < 2". The *** code of procedure HRElect is listed below. 

Procedure HRElect{h{z, £), z, £) 

Initial Conditions: 

1. Node {h{z, £), z, £) is the leader of ■ 

2. All nodes in HB^j^ n) have viarable HL set to the label of the hypercuhe 
leader. 

Invocation of the Procedure: 

Node (h{z,£), z,£) sends message RingTest{{h{z,£), z,£),0, True) along 
direction g. 

Upon receiving message RingTest{id, i, b) from direction g~^: 

Hid is the label of the node which invokes procedure HRElect. 

Hi is an integer from 0 to n — 1 and 6 is a boolean 
//with the value of either True or Ealse. 
if (/ < n — 1) 

{ 

if (5 == Ealse\\HL > id) 

send message RingTest{id, i+1, Ealse) through direction g. 
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else 

send message RingTest{id, i + l,True) through direction g. 

} 

{ 

// This means the message gets back to the leader node {h{z, £), z, £) 
if (6 == True) 

// This node passed all tests in the ring, becomes the new leader. 
Current node {h(z, £), z, £) becomes the leader of 

else 

// Failed the test, becomes non-leader. 

Current node becomes non-leader in HR^*’^’*} . 

(m,n) 

} 



As we can see, the procedure can be invoked from any hypercube leader 
(h{z,£), z,£). It consists of sending message RingTest{id,i,b) carrying three pa- 
rameters. The first parameter is the id of the hypercube leader that invokes the pro- 
cedure. The second parameter i is an integer that counts the number of nodes message 
RingTest{id, i, b) has passed except the origin node. 6 is a boolean to indicate if id is 
large enough to be the leader of the part of HR!'^^y that the message has passed so 
far. 

Since every node increments i and relays the message through direction g, message 
RingTest{id, i, b) will go through a ring of n nodes and get back to the origin node 
{h{z,£),z,£) after that. At each intermediate node, the variable HL is compared to 
id that comes from the message. If iLL is larger than id, then b is set to False to 
indicate that the node invoke the procedure is not large enough to be the leader of 
HR^*^'*y When message RingTest(id, i, b) gets back to the origin node that invokes 

the procedure, the node checks the value of b and becomes the leader of HR^*^y or a 
non-leader accordingly. 

Lemma 2. For any arbitrary value of z, 0 < z < 2”, after all n hypercube leaders in 
F[R^*f^y execute procedure FdBElect and get back the message RingTest, only one 

of them will become the leader of FlR^*f^y and all others will become non-leader. 

Proof. For any arbitrary value z, there are n hypercube leaders in FIR^y^y. They 
are fh{z,£),z,£), 0 < £ < n. Each of them invokes procedure HRElect and will 
determine to becomes a leader or non-leader depending on the value of b returned from 
message RingTest. 

Consider an arbitrary leader (h{z,£),z,£) among them. This node starts procedure 
HRElect by sending message RingTest ((h(z,£),z,£),0,True) through direction g. Be- 
cause every node gets the message also relays it through direction g, the message will 
traversal every node in -^^hich include n nodes: [h{z, £), z, j), 0 < j < n. 

And because node (h{z,£),z,j) has variable HL set to the id of leader of hypercube 
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i-C- {h{z,j), z,j). The id (/i(z, £), z, £) is compared with the id of leaders of 
other hypercubes i-e. {h{z, j), z, j), 0 < j < n. If some node becomes the 

leader after executing procedure HRElect, it is assured that its id is larger than all other 
leaders in 0 < j < n. Therefore, from n hypercube leaders {h{z,t), z,t), 

0 < f < n, only one of them can claim as the leader of after executing 

procedure HRElect', all other n — 1 nodes will become non-leaders. 



Remark 8. 

- After every hypercube leaders (elected from hrst step) complete procedure 
HRElect, there will be 2" nodes remain as leader, with each from HR^*^'*^ , 
0 < z < 2". 

- After a node becomes the leader in HR[*'^'*) , it broadcast its id to all nodes in 

(m,n) ’ 

node receives the broadcast message will save the leader’s id 

into variable HRL. 



Lemma 3. There will be n x 2*”+" + x 2" number of messages generated in the 
second step, including procedure HRElect and the broadcast process afterwards. 

Proof. There are n x 2” hypercube leaders elected from the hrst step. Each leader 
executing procedure HRElect generates n messages. And there will be 2" leaders 
elected afterwards. Each leader in , 0 < z < 2” will broadcast it id, which 

takes n X 2™ message. So the total number of message needed in this step is n x 2"‘+”-|- 
X 2". 

2.3 Leader Election in 

After the second step, there will be one leader left in each HR^*f^'*^ , 0 < z < 2". We 
use (h{z), z, f (z)) to denote the label of the leader in HR^*f^'*y 

Remark 9. Similar to the second step, h{z) or f (z) does not specify the function to 
derive the hypercube part label or permutation index of the leader node. They only 
indicates the dependency relationship with between those labels with z. 

In the third step, which is the hnal step, the objective is to elect one leader for entire 
hyper-butterfly graph Since we already have 2” leaders in each HR^y^y , 

we use similar approach as in the second step to elect one node from the those leaders 
to become the hnal leader. 

In this step, each of the 2" leaders from the second step sends TreeTest mes- 
sage through a tree structure in butterfly graph. We ensure that the message will get to 
the nodes with different complementation index, so that the id of each leader will be 
tested to see if it is larger than the ids of all other leaders. As in procedure HRElect, 
only the node that passes all tests will become the leader which is the leader of entire 
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hyper-butterfly graph The pseudocode listing of the details of the procedure 

HB Elect to be invoked by every leader of is omitted for lack of space. 

There are two types of messages used in the procedure. The first type of message is 
T reeT est which travels down a binary tree because each node distributes the message 
through direction g and /. The parameter id and b have the same meaning as in the 
second step, while i indicates the current level of the tree. The other type of message 
is TreeReply which go through the reversal path of TreeTest. Only the leaf nodes 
will compare the value of HRL with id that comes from the message. The intermediate 
nodes only act as transmit message TreeTest to both children and collect TreeReply 
from them. We state the following lemmas (the proofs are omitted for lack of space). 

Lemma 4. Consider the execution of procedure HB Elect from any node 
{h{z), z,£{z)), there are 2* nodes that receive both message TreeTest{id,i,b) and 
TreeReply (id, i, b). These 2* nodes have the same permutation index and hypercube 
label, but different complementation index. 

Lemma 5. After every leader from the second step completely execute procedure 
HBElect, only one node will become leader of HB(^^ n)- All other leaders will be- 
come non-leader. 

Lemma 6. There are 2^”+^ number of messages generated in the third step. 

Theorem 1. The total number of message needed for a leader election algorithm in 
is 8.24 X n X 2™+" + (n^ + 2"+^) x 2". 
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Abstract. The TCP (Transmission Control Protocol) is a critical com- 
ponent in an ad hoc wireless network because of its pervasive usage 
in various important applications. However, TCP’s congestion control 
mechanism is notoriously ineffective in dealing with time-varying chan- 
nel errors even for a single wireless link. Indeed, the adverse effects of 
the inefficient usage of wireless bandwidth due to the large TCP timers 
are more profound in a multihop communication session. In this paper, 
we design and evaluate local recovery (LR) approaches for maintaining 
smooth operations of a multihop TCP session in an ad hoc network. 
Based on our NS-2 simulation results, we find that using the proposed 
LR approaches is better than using various well-known ad hoc routing 
algorithms which construct completely new routes. 

Keywords: wireless TCP, multihop communications, ad hoc networks, 
routing, local recovery. 



1 Introduction 

The TCP (Transmission Control Protocol) is the most widely used transport 
protocol and, more importantly, will continue to be a critical component when 
the Internet becomes completely pervasive in a wireless manner [6]. Unfortu- 
nately, due to the fact that TCP was not designed for a wireless environment, 
in which link transmission errors are the norm rather than the exception, its 
performance can be unacceptable under a time-varying communication channel 
[5] . Specifically, congestion is assumed to be the primary reason for packet losses 
in TCP. While this is true in wired networks, the throttling actions in a wireless 
environment can be detrimental. Indeed, unnecessary reduction in network load 
over a long period of time (TCP’s timers are on the order of tens of seconds) 
leads to very inefficient use of the precious channel bandwidth and high delays. 

Recently, many adaptive TCP approaches for various wireless environments 
have been suggested. The major objective of these schemes is to make TCP 
respond more intelligently to the lossy wireless links. According to [1,5], there 
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are three major classes of wireless TCP approaches: end-to-end, link layer, and 
split-connection approaches. Unfortunately, all these previous approaches are 
only suitable for use in a single wireless link. For ad hoc networks where devices 
communicate in a multihop manner, these protocols are inapplicable because we 
cannot afford to have each pair of intermediate devices on a multihop route to 
execute these wireless TCP protocols [3,9]. Indeed, if a multihop ad hoc route is 
broken (e.g., due to deep fading in one of its links), the performance of a TCP 
session over such a route can be severely affected. The most obvious result is that 
the TCP sender will eventually discover such breakage after several unsuccessful 
retransmissions (i.e., after a long delay due to the large TCP timers) and then 
initiate a new session after setting up a new route. This can lead to unacceptably 
long delay at the receiver side. 

In this paper, we study the performance of two local recovery approaches, 
which work by swiftly repairing the broken link using a new partial route. In 
Section 2, we describe our proposed local recovery approaches. Section 3 contains 
our simulation results generated by the NS-2 [10] platform. 

2 The Proposed Approach 

When the original route is down, we do not simply inform the source that the 
route cannot be used. Instead, we suppress the notification which is transmitted 
to the source by TCP, and then find a new partial route between the separated 
nodes to replace the broken part of the old route. Our approach, remedial in 
nature, is a local recovery (LR) technique [4]. The essence of LR is to shield the 
route error from the source in the hope that we can avoid incurring the excessive 
delay induced by TCP. Indeed, since the problem is found locally, the remedial 
work should be done locally. 

For example, suppose that due to channel fading and nodes’ mobility, the 
link between node N and -|- 1 is broken. Firstly, we suppress the upstream 
notification generated by TCP. Afterward, we find if the route table of node N 
has another route to node -|- 1. If there is a new route to the -|- 1 (i.e., the 
next node of such a route is not A^ -I- 1), then the broken route is immediately 
repaired by using this route. If no such route exists, local recovery packets will 
be sent to repair the route. 

A local recovery timer is set to make sure the local recovery process will not 
consume more time than to re-establish a new route by the source. Thus, if the 
local recovery timer is expired, we give up local recovery and make use of the 
full blown ad hoc routing protocol. 

In the remedial process, a node N generates the local recovery route request 
(LRRREQ) packet, which includes the following information: type of the packet, 
local recovery source address, local recovery destination address, original destina- 
tion address, local recovery broadcast identifier (ID), and hop count. Whenever 
node N generates a LRRREQ, the local recovery broadcast ID is increased by 
one. Thus, the local recovery source and destination addresses, and the local 
recovery broadcast ID uniquely identify a LRRREQ. Node N broadcasts the 
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LRRREQ to all nodes within the transmission range. These neighboring nodes 
then relay the LRRREQ to other neighboring nodes in the same fashion. An in- 
termediate node, upon receiving the LRRREQ, first checks whether it has seen 
this packet before by searching its LRRREQ cache. If the LRRREQ is in the 
cache, the newly received copy is discarded; otherwise, the LRRREQ is stored 
in the cache and is forwarded to the neighbors after the following major modifi- 
cations are done: incrementing the hop count, updating the previous hop node, 
and updating the time-to-live (TTL) field. 

When node -|- 1 or some other intermediate node, which has a fresh route 
to the node A^-|- 1, receives the LRRREQ, it then generates a local recovery route 
reply (LRRREP) packet, which includes the following information: type of the 
packet, local recovery source address, local recovery destination address, original 
destination address, hop count, and TTL. The LRRREQ is then unicast to the 
local recovery source along the reverse path until it reaches the local recovery 
source. During this process, each intermediate node on the reverse path updates 
its routing table entry to the local recovery destination and original destination. 

Although the new partial route is found from node N to node A^-|- 1, updating 
is needed for the original route. As described above, there are two cases where 
updating of the original route must be done. The first case is the event that the 
local recovery destination receives the LRRREQ. The second case is the event 
that an intermediate node gets the LRRREQ and it has a fresh route to the 
local recovery destination in its routing table. 

According to the different directions, forward and backward updatings are 
carried out. The forward updating process is triggered by receiving the update 
packet, which contains the following information: type of packet, update desti- 
nation address, original destination address, hop count, and TTL. The backward 
updating process is triggered by receiving the LRRREP packet. In any updating, 
the original route should be re-established. In the first case, only backward up- 
dating is done, while in the second case, both forward and backward updatings 
are needed. 

In the former case, node A^ -|- I receives the LRRREQ, and thus, backward 
updating is done through the route of nodes A^-|-I, 3, 2, I, N. In the latter 
case, forward updating is done through the route of nodes 2, 3, N + 1, while 
backward updating is done through the route of nodes 2, 1, N. The detailed 
updating process is as follows: when node 2 receives the LRRREQ and it has a 
route entry to node A^-|- 1, node 2 sends the update packet to node 3 according to 
the route entry to node A^ -|- 1. Upon receiving the update packet, node 3 should 
update the route entry to the original destination node D and then check if it 
is the local recovery destination. The same forward updating process continues 
until the update packet is received by the local recovery destination. On the other 
hand, LRRREP is sent to node 1 following the reverse route. Upon receiving the 
LRRREP, node 1 should update the route entry to the original destination node 
D and then check if it is the local recovery source. The same backward updating 
process continues until the LRRREP is received by the local recovery source. 
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This variant of our approach is similar to the mechanism we described above. 
The only difference is that the goal of route reconstruction is to find a new partial 
route from node N directly to the destination. 

3 Performance Results 

In our study, we use packet level simulations to evaluate the performance of TCP 
in ad hoc networks. The simulations are implemented in Network Simulator (NS- 
2) [10] from Lawrence Berkeley National Laboratory (LBNL) with extensions for 
wireless links from the Monarch project at Carnegie Mellon University [2]. The 
simulation parameters are as follows: 

— number of nodes: 50; 

~ testing field: 1500m x 300m; 

— mobile speed: uniformly distributed between 0 and MAXSPEED (we choose 
MAXSPEED to be 4, 10, 20, 40, 60m/s, respectively); 

— mobility model: modified random way point model [12]; 

— traffic load: TCP Reno traffic source; 

— radio transmission range: 250m; 

— MAC layer: IEEE 802.11b. 

Each simulation is run for 200 seconds and repeated for ten times. We com- 
pared four protocols in our simulations. They are DSR (Dynamic Source Rout- 
ing) [7], AODV (Ad Hoc On-Demand Distance Vector) [11], LRl and LR2. LRl 
is the local recovery protocol in finding the new route between node N to the 
destination. LR2 is the local recovery protocol in finding the new route between 
node N and node A^ -|- 1. 

To evaluate TCP performance in different routing protocols, we compare 
them using four metrics: 

1. Average End-to-End Delay: the average elapsed time between sending by 
the source and receiving by the destination, including the processing time 
and queuing time. 

2. Average Throughput: the average effective bit-rate of the received TCP pack- 
ets at the destination. 

3. Delivery Rate: the percentage of packets reaching the destination (note that 
some packets are lost during the route breakage and the route reconstruction 
time). 

4. Control Overhead: the data rate required by the transportation of the routing 
packets. 

Our first set of simulation results are summarized in Figure 1. We compare 
the performance of the LR approaches against that of several well-known ad 
hoc routing protocols: AODV (Ad-Hoc On-Demand Distance Vector) [11], DSR 
(Dynamic Source Routing) [7], DSDV (Destination Sequenced Distance Vector) 
[11], and RICA (Receiver-Initiated Channel Adaptive) protocols [8]. 
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Firstly, we find that TCP is idle most of the time when used with the DSDV 
and DSR routing protocols. On the other hand, other routing protocols can 
cooperate with TCP quite well. It should be noted that DSDV is table-driven 
routing algorithm. When the source has no route to the destination, it uses a long 
time to find a new route. This frequently leads to TCP timeout. Furthermore, 
the nodes are moving during almost the whole simulation time. Consequently, 
routes in the table can be stale very quickly and cannot be used. As for DSR, it 
is an on-demand algorithm, even though it has a route cache containing routes. 
When using DSR, a mobile device first checks if it has a route to the destination 
in the cache. Again, similar to the case of DSDV, the routes in the cache become 
stale very quickly and thus, new routes have to be found. However, as DSR is 
an on-demand algorithm, it can generally respond much faster than DSDV to 
find new routes. 




(a) average delay 



(b) average throughput 





(c) average rate of successful delivery (d) average control overhead 

of packets 



Fig. 1. Protocol performance. 



Figure 1(a) shows the average end-to-end delay of each protocol. It should 
be noted that the delay is effective delay — the delay of the packets that actually 
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arrive at the destinations. We can see that DSDV has the lowest delay in all 
protocols. The reason is that each device has a table to contain routes. DSR 
has higher delay than DSDV because DSR caches only recently used routes. 
Comparing the other four routing protocols, where TCP almost does not become 
idle, RICA has the lowest delay and AODV has the highest delay. As RICA 
always chooses the best route, and the LR approaches can automatically recover 
the broken route locally, they generate smaller delays compared with AODV. 

Figure 1(b) depicts the TCP throughput over the simulation time. Obviously, 
since DSDV and DSR have much idle time, they have lower throughputs. The 
LR approaches have higher throughputs than AODV due to the local repairing 
mechanisms. In particular, LR2 exhibits a better performance than LRl because 
the time consumed by node N to find node + 1 is less than that to find the 
destination on the average. Figure 1(c) shows the delivery rate. DSDV’s delivery 
rate decreases dramatically with increasing device speed. This is because when 
the speed increases, stale routes are more common. This detrimental effect of 
mobility also applies to DSR. In the other four routing protocols, the delivery 
rate has little change with increasing speed. Figure 1(d) shows the control over- 
head required by the transportation of the routing packets. Again, because of 
idleness, DSDV and DSR have lower control overheads. RICA has the highest 
control overhead. The reason is that RICA must transmit CSI (channel state 
information) [8] packet to assist finding the best path. The LR approaches have 
less control overheads than AODV because when the route is broken the LR 
approaches need not inform the source to find a new route. In summary, the LR 
approaches are the best routing protocols for integrating with TCP in an ad hoc 
network. 

Another set of results is about the evaluation of the TCP performance against 
different number of nodes (30, 50, 100, and 200 nodes). The mean mobile speed 
is fixed at 5 m/s and the maximum speed is fixed at 10 m/s. As can be seen 
in Figure 2, using the Local Recovery algorithms can greatly shorten the end- 
to-end delay from source to destination and lead to a higher throughput. LRl 
and LR2 have higher throughputs than AODV and DSR. Furthermore, LRl and 
LR2 have lower delay than AODV. In general, LR2 outperforms LRl. We can 
draw the following conclusions. 

~ Local Recovery suppresses the notification to the source so that the saved 
time can be utilized for local construction of new routes. 

— The source does not need to setup a new route so that the buffered packets 
in the intermediate nodes need not be sent again. 

— The unnecessary TCP timeout can be avoided in the source. 

However, DSR has the lowest delay in all compared protocols. This is because 
the delay is calculated by considering the received packets only. In fact, DSR 
has much longer idle time. For the same reason, DSR has the lowest control 
overhead. In general, LR has lower control overhead than AODV. Finally, we 
can see that LR has a higher delivery rate than AODV. Furthermore, in LR and 
AODV, the delivery rate does not change much with the the number of nodes. 
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(a) average delay 



(b) average throughput 
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ery of packets 
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(d) average control overhead 



Fig. 2. Simulation results for various number of nodes. 



We also consider the route setup delay of the protocols. The results are shown 
in Figure 3. In general, DSR has the lowest setup delay in all compared protocols. 
This is because the DSR source has the route cache to set up connection imme- 
diately. However, if a suitable cached route cannot be found, DSR will use a long 
time to find a new route. Moreover, the setup delay is rather unacceptable, e.g., 
more than 6 sec. In that case, the route failure causes the TCP sender to become 
idle until a new trial begins after a long time. AODV and the LR approaches 
have nearly the same route request and reply mechanisms. They require some 
time to find the route to set up connection. In particular, LRl and LR2 have the 
same setup delay. When the number of node is small, i.e., 30 or 50, AODV and 
the LR approaches have similar setup delay. However, with a larger number of 
nodes — 100 or 200, the setup delays between AODV and LR become obviously 
different. The setup delay of the LR approaches are smaller than that of AODV. 
The reason is that with increased number of nodes, the route re-establishment 
process is more frequent, especially when TCP synchronization packet has not 
been received by the destination. Since the LR approaches can suppress the route 
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error notification to the source and locally recover the route, the unnecessary 
timeout can be avoided or the duration of timeout is much reduced in the TCP 
source. 




Fig. 3. The proposed local recovery algorithm. 
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Abstract. Many proposed distributed hash table (DHT) schemes for peer-to- 
peer network are based on some traditional parallel interconnection topologies. 
In this paper, we show that the Kautz graph is a very good static topology to 
construct DHT schemes. We demonstrate the optimal diameter and optimal 
fault tolerance properties of the Kautz graph and prove that the Kautz graph is 
(l+o(l))-congestion-free when using the long path routing algorithm. Then we 
propose FissionE, a novel DHT scheme based on Kautz graph. FissionE is a 
constant degree, 0(log/V) diameter and (l+o(l))-congestion-free. FissionE 
shows that the DHT scheme with constant degree and constant congestion can 
achieve 0(log/V) diameter, which is better than the lower bound 
conjectured before. 



1 Introduction and Related Work 

In recent years, peer-to-peer computing has attracted significant attentions from both 
industry and academic research. The core component of many proposed peer-to-peer 
systems is the distributed hash table (DHT) schemes [1] that use a hash table-like 
interface to publish and lookup data objects. DHT schemes for structured P2P systems 
have attracted much attention in academic researches for their desirable 
characteristics, such as scalability, robustness, self-management, and generality. 

Many proposed DHT schemes are based on some traditional interconnection 
topology: Chord [2], Tapestry and Pastry are based on the hypercube topology; CAN 
[3] is based on the d-torus topology; Koorde [4] and D2B [5] are based on the de 
Bmijn graph; Viceroy [6], Ulysses [7] are based on the Butterfly topology. Compared 
with hypercube, de Bmijn or toms topology, Kautz graph has some better properties. 
In this paper, we demonstrate the optimal diameter and optimal fault tolerance 
properties of the Kautz graph and prove that the Kautz graph is (l+o(l))-congestion- 
free when using the long path routing algorithm. Then we propose FissionE, a novel 
DHT scheme based on Kautz graph. FissionE is a (l+o(l))-congestion-free DHT 
scheme with constant degree and 0(log/V) diameter. 

Two important measures of DHT schemes are degree, the size of routing table to 
be maintained on each peer, and diameter, the number of hops a query needs to travel 
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in the worst case. In many existing DHT schemes, such as Chord, Tapestry, and 
Pastry, both the degree and the diameter tend to 0(log/V), while in CAN the degree 
and the diameter are 0{d) and 0{dN‘^‘^) respectively. An open problem posed in [1] is 
whether there exists DHT scheme with 0(d) degree and 0(log/V) diameter. Recent 
work [4,5, 6 , 7, 8 ] has shown that there are DHT algorithms to achieve O(logA) 
diameter with 0(1) degree, but the algorithms cause severe congestion in P2P 
networks. Xu et al. [7] systematically studied the degree-diameter tradeoff of DHT 
schemes and defined the concept of congestion, and then clarified the role that 
congestion-free plays in the degree-diameter tradeoff A conjecture posed in [7] is that 
is the asymptotic lower bounds for the diameter when the degree is no more 
than d and the network is required to be c-congestion-free for some constant c”. 
FissionE is a novel constant degree and (l+o(l))-congestion-free DHT scheme with 
O(logA) diameter. FissionE can achieve better bound than the conjecture above. 

FissionE is a constant-degree and (l+o(l))-congestion-free DHT scheme with 
< 9 (log 2 A 9 diameter. The average degree of FissionE is 4, and the diameter of FissionE 
is less than 2 *log 2 A^; the average routing path length of FissionE is about log 2 A^. 
Compared with FissionE, the degree of Ulysses is 0(log/V) which is not constant. The 
expected degree of D2B is constant, but its high probability bound is O(logA) , i.e., a 
few unlucky peers would be of degree Q(logA). The expected diameter of Viceroy is 
about 31og2A^, however its O(logV) diameter is achieved not with certainty but “with 
high probability”. Among the well-known DHT schemes, only CAN and Koorde 
definitely have constant degree. CAN is of 2d degree, but its diameter is 0{dN^^‘^), and 
so it does not scale as well as FissionE. Koorde [4] is constant degree and O(logV) 
diameter, but it isn’t (l+o(l))-congestion-free and it’s congestion is severer than that 
in FissionE. 

The remainder of the paper is organized as follows. Section 2 introduces the Kautz 
graph and its properties. Section 3 proves the low congestion property of the Kautz 
graph. Section 4 describes the design of FissionE. Conclusions and future work is 
discussed in Section 5. 



2 Static Kautz Graph 

Many DHT schemes are based on the traditional interconnection network topologies. 
Different from dynamic P2P network, the traditional interconnection network poses 
some limits on the number of nodes it can support and does not support the dynamic 
joining or departure of nodes. To distinguish them, the traditional interconnection 
network is called static network in the paper. FissionE exploits Kautz graph as its 
static topology. This section discusses the Kautz graph and its properties. 

Definition 1. The Kautz string ^ of length n and base d is defined as a string aia 2 ...a„ 
where U/G { 0 ,l, 2 ,...,d} afr ai+i{\<i<k-\). 

Definition 2. The Kautz namespace KautzSpace(d,k) is defined as the set containing 
all the Kautz strings of length k and base d, i.e., 

KautzSpace(d,k) = { aia 2 ..Mk \ e {0,1,2,. . ,,d} (l<i<k) and a,- 7 ^ ai+j (l<i<k-l)} . 
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Definition 3. The Kautz graph K{d,k) [9] is a directed graph whose nodes are labeled 
with a Kautz string of length k and base d. For simplicity, we name a node with its 
label. Every node \J=UjU 2 ...Uk in Kautz graph K(d,k) has d outgoing edges: for each 
a€ {0,l,2,...,d} and ai^Uk, node u has one outgoing edge to node Y=U 2 U 3 ...uta_ 
(denoted by u— >v), i.e., there is an edge from m to v iff v is a left-shifted version of u. 




Obviously there are N=d'^+ct'‘ nodes in the K{d,k) graph and each node in K{d,k) is 
of in-degree d and out-degree d. Figure 1 shows Kautz graph K{2,3). 



Table 1. The degree-diameter tradeoff of different topologies 



Topology 


Degree 


Diameter 


Average path length 


de Bruijn 


d 


logdA 


logdA^-l/(d-l) [11] 


Hypercube (Chord) 


log2iV 


log2iV 


1/2 log2/V 


d-torus (CAN) 


2d 


l/2dM'“ 


l/4dA‘'“ 


Butterfly 


d 


2 1ogdA(l-o(l)) 


about 3/2 logd/V[ll] 


Kautz (FissionE) 


d 


D=log,iiV- logd( 1+1/d) 


D-l/(d+l) 



Assuming a graph of fixed degree d and diameter k, the maximum number of 
nodes N in the graph is the Moore bound [10] \+d+d^+. ,.+d^. The Moore bound is not 
achievable for any non-trivial graph. The number of nodes in the Kautz graph K{d,k) 
is d'^ ' very close to the Moore bound. In fact, Kautz graph is the densest graph 
when the diameter is two. From the Moore bound, it is easy to see the low bound of 
the diameter of a graph with N nodes is |" log^ ( {d - \)-\- 1)"| — 1 and the diameter k 

of Kautz graph K(d,k) reaches the lower bound as |"log^ — 1) -I- 1)"|— 1 = 

[log, ((/ + J"‘)(flf-1) + 1)1-1= [log, + 1)1-1 = k. Thus 

Kautz graph K{d,k) has an optimal diameter. 

The Kautz graph also has optimal fault tolerance [13]. That is, Kautz graph K{d,k) 
of degree d is <i-connected (i.e., there are d node disjoint paths between any two 
nodes). The corresponding de Bmjin graph is (ii-l)-connected. In addition, Kautz 
graph K{d,k) has a better load balancing feature than the de Bruijn graph as shown in 
[9]. Table 1 shows the degree-diameter tradeoff of different topologies. 
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3 Low Congestion Routing in Kautz Graph 

There are many routing algorithms for Kautz graph. FissionE uses the Long Path 
Routing Algorithm in Kautz graph [9]. Long path routing in Kautz graph from node U 
to node V is accomplished by taking the string U and shifting in the symbols of V one 
at a time until the string U has been replaced by V. For instance, given two nodes 
U=UiU 2 . . .Uk and V=ViV 2 . . .Vk, the long routing path from U to V is a path of length k 
shown as below: 

U=UiU2. . .Uk^U2U3. . .UkVi^U3U4. . .UkViV2 ^ . . ..^UkViV2...Vk-i^ViV2...Vk (if Ui^Vi) 
or a path of length k-1 shown as below: 

U=UiU2. . .Uk^U2U3. . .UkV2 ^U3U4. . .UkV2V3 . ..^UkV2.. Vk.iVk=ViV2.. Vk (if Uk=Vi) 
For example, with the long path routing algorithm, the routing path in Kautz graph 
K(2,3) from node 012 to node 102 is 012— > 121— >210— »-102, and the routing path 
from 012 to 202 is 012— >120— »-202. 

The long path may contain duplicate nodes and the algorithm keeps it for 
symmetry and simplicity. Obviously, with the long path routing algorithm, the path 
length between any nodes is k or k-\, and the average path length is 
h^dl{d+\)*k+\l{d+\)*{k-\)^k-\l{d+\). Compared with the shortest path routing 
algorithm, the long path routing algorithm has a little longer average routing path 
length, while it has better load balance characteristics and the average delay is even 
less than the shortest routing algorithm under heavy load [9] (the severe congestion on 
some nodes leads to some delay). FissionE adopts the long path routing algorithm. 

Now we consider the congestion characteristic of long path routing in Kautz graph. 
We use the concept “congestion-free” from [7]. 

Definition 4 [7]. A P2P network is c-congestion-free (c is constant and c>l) if its 
static network is both c-node-congestion-free and c-edge-congestion-free under 
uniform all-to-all communication load. The c-congestion-free is also called constant 
congestion. A network is said to be c-node-congestion-free if no node is handling 
more than c times the average traffic per node. A network is said to be c-edge- 
congestion-free if no edge handling more than c times the average traffic per edge. 
The uniform all-to-all communication load is defined as: for each pair of nodes U,V 
(U?!:V), there is a unit of traffic from U to V. The static P2P network is referred to the 
case that all nodes in the identification space exist and are alive, i.e., nodes in P2P 
network form the complete static topology. 

Now we turn to the congestion property of the Kautz graph and some lemmas 
referred in the proof are shown after the Theorem 1. 

Theorem 1. When using long path routing algorithm, Kautz graph K(d,k) is (7+o(l))- 
congestion-free. 

Proof. Define S\^{uiU 2 ...UkUiU 2 ...Uk \ UiU2...UkUjU2...Uk^KautzSpace(d,2k) }, 
S2^{ujU2...UkU2...uic I UiU 2 ...UkU 2 ...Uk E KautzSpace(d,2k-l) 
and UiU 2 ...u/c=UkU 2 ...Uk}, 

S3=KautzSpaceid,2k)-Sl, S4^KautzSpaceid,2k-l)-S2, S=S3 U S4 
The uniform all-to-all communication load is represented by the set M: 

M={routing paths from U to V | U,V are nodes in K{k,d) and WV} 
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Define mapping f: VS e M , assuming J is a routing path of length n: 
bib 2 . ..bk-^b2b} . . . bk+i^b3b4.. ■ bk+2~^ ■ ■ ■ -^b„b„+i . . . b„+t, 
then f(?i)^bjb 2 ...bk...b„+t 

From Lemma 1, f is a bijeetion from M to S. Thus under uniform all-to-all 
eommunieation load, for any node R^rir 2 ...rk, its load equals the number that the 
Kautz string r^r 2 ... reappears as a substring (exeept for the prefix) of the Kautz strings 
in S. From Lemma 2, the loadLnOf R is: 

^ ik * d '‘ + (k - - k (fi Lt) 

" \k*d'‘ +{k-l)d'‘-^ -k + l 

The average path length in Kautz graph K(d,k) is h^k-l/(d+l), thus the average 
load of a node is 

AyegiLnHN-l)*h^id'‘+d'‘-^A)*{kA/{dn))^k*d'‘+ik-l)*d'‘-^-k+l/{d+l) 

Beeause Max(L„)- Aveg(L„)= d/{d+l)« Aveg(L„) , and 
Max(L„)/Aveg(L„)<l+l/((i!:-l)*(/+/-'))=l+l/((M)*A0=l+O(l/MogdA)=l+o(l) 
Thus the statie Kautz graph is (l+o(l))-node-eongestion-free. 

In Kautz graph K(d,k), the edge from rir 2 ...rk to r 2 ...ri^t+i can be uniquely 
represented by the Kautz sring rir 2 ...ri^t+i, and eaeh Kautz string bib 2 —bkbk+i in 
Kautz namespaee K(<i,A:+7) ean be uniquely represented by the edge from node 
bib 2 ...bkto node b 2 ...bkbk+i- Thus under uniform all-to-all eommunieation load, for 
any edge e=rjr 2 ...rkrk+i in K{d,k), its load equals the number that the Kautz string 
rir 2 ...rkrk+i appears as a substring of the Kautz strings in S. From Lemma 3, the load 
LnOf R is: 

k * d'-' + {k- l)d‘-^ - k (r, = r,^,) 

L ^ - < k * d + (k - l)d - k + I (r, = and = r^^,) 

k* d^-' + {k - \)d^-^ (others ) 

In Kautz graph K(d,k), the average load of edges Avg(Le)=A*(A-l)*A/|E| (|E| is the 
number of edges in K(d,k) and |E|=N*d), thus 

Avg(L,)= N*(N-\)*hl(N*d) = (N-\)*hld = k*d''-^+(k-\)*d'^-^-kld+\l(d*(d+\)) 
Beeause Max(Le)- Avg(Le) ^ kid - H(d*(d+\)) = hid « Avg(Le), and 
Max(Le)/Avg(Le) -l+(hld)l((N-\)*hld)^\+ l/(A-l)=Ro(l) 

Thus Kautz graph K(<7,A:) is (l+o(l))-edge-eongestion-free. 

Therefore, Kautz graph K(d,k) is (l+o(l))-eongestion-free. □ 

From Theorem 1, it is easy to get that the Kautz graph is eonstant eongestion. 

Now we give the lemmas referred in the proof above. 

Lemma 1 The Mapping/ is a bijeetion from M to S. 

Proof. Obviously, S3nS4=0, S^KautzSpace(d,2k)l-i KautzSpace(d,2k-l)-Sl-S2. 

First we prove that f is an injeetion. 

V6€M, then 6 is a routing path from a eertain node \J=UiU 2 ...Ukio another node 

V-viV2.yfiJfV). 

1) if UkT^Vi, then the routing path 5 would be : 

\J=uiU2...uir^ U2U}...utVi^ U}U4...UkViV2^ ■ ■ ■ .^uy 1 V 2 .Vk-i^v iV2..yk^V 

Thus f(?>)=U]U 2 ...UkVjV 2 ..yk, thereby f(?>)€:KautzSpace(d,2k). Sinee \5fV, 
thereafter f(5) ^ SI, thus f(5) e KautzSpace{d,2k)-Sl, i.e. f(5) e S3. 
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2 ) if Uk=vi, then the routing path 6 would be : 

U= UlU2:.Uk^U2U3...UkV2^U3U4...UkV2V3^....^UkV2..yk.lVk = VlV2...Vk=V 

Thus f{?))=ujU2..MkV2..yk, thereby f( 6 ) ^ KautzSpace(d, 2 k-l). Since WV, i.e. 
uiU2-.ui^uiy2„yk, thereafter f( 6 )^S 2 , and f( 6 )€ KautzSpace{d, 2 k-\)-S 2 , i.e. 
f( 6 )eS 4 . 

So V 5 e M, f( 5 ) e S, that is, the range of mapping f is S, and f is a mapping from M 
to S. Obviously, the identieal routing path ean only be mapped to one Kautz string, 
and different routing paths will be mapped to different Kautz strings, thus f is an 
injeetion. 

Then we’ll prove that f is a surjeetion. 

Vf e S, sinee 83084 = 0 , thus we may find that ije S 3 or ije 84 . 

If ije S 3 , let ci]Ci2'-.ci2k-iCi2k' Aeeording to the definition of S 3 , cijci2--.cik and 

ak+iak+2---Ci2k are both valid Kautz strings in KautzSpace(d,k) , and 

aia2-.ai^ak+iak+2--ci2k, a^+iT^ak. Consider routes 5 ’ in Set M with length k whieh 
originate from souree node aia2...akio destination node: aia2-..ak-^ a2a3...akCik+i^ 
0304. .Mk+2^ -^ak..M2k-i -^ak+]..M2k, we may find that <^f( 6 ’), i.e. 36 ’ eM, s.t. 

If <^e 84 , let ^ = bib2-.b2k-i- Aeeording to the definition of S 4 , we ean get that 
bib2-.bk and bkbk+i--b2k-i are all valid Kautz strings in KautzSpace(d,k); what’s more, 
bib2-.bki^ bkbk+i--b2k-h Consider route 6 ’ with length k -1 in set M whieh originates 
from source node bib2-..bk to target node bkbk+i—b2k-i '■ bib2-.bk-^ b2b3...bkbk+i^ 

b3b4...bk+2^ -^bk.]bk...b2k-2^bkbk+i:.b2k-i- Thus we may find f( 6 ’), that is, 

36 ’ gM, s.t. i^f( 6 ’). 

Thus f is a bijection. □ 

Lemma 2 For any Kautz string R=rjr2...rk in KautzSpace(d,k), the number of R 
appearing as the substring(except for the prefix) of Kautz strings in set S is: 

^ _\k*d^ +{k-\)d'‘-' -k 

\k * d'‘ + {k - \)d - A: -H 1 (fi = 'T ) 

Proof S= KautzSpace(d, 2 k) U KautzSpace(d, 2 k- 1 )- S 1 - S 2 . 

Based on theories of combinatorics, the number of times that R appears as a 
substring (except for the prefix) of Kautz string in KautzSpace{d, 2 k) are k*ct. (R can 
be placed in k different places in a Kautz string with length 2 k, and the other k places 
left all have d choices.). Similarly, the number of times that R appears as a substring 
(except for the prefix) of Kautz string in KautzSpace{d, 2 k-\) is {k-\)*d'''^. 

Then we calculate the number of times that R appears in 81 and 82 . If R appears as 
a substring of the Kautz string ^ in Sl={ujU2..MkUjU2...Uk \ UiU2..MkUjU2...Uk G 
KautzSpace(d, 2 k) } and R appears at No. m place of assuming 
ip=bib2...bifk+ibk+2-b2k=bib2...bifib2...bk, then \]^b„...bif,b2-b„.i {bj^b„,.,), i.e., 
rifrk . Similarly, if R appears in S 2 , ri=rk. 

Thus if rffTk, R would not appear in 82 . For each m that satisfies l<m<k, we could 
construct a unique Kautz string i’^rk.m+2..rkrir2..rkrir2..rk-m+i with length 2 k\ f G SI, 
R appears at No. m place of f and R also appears at No. k+\ place of r]r2...rkrjr2...rk 
that is in 81 . Therefore, the number of times that R appears in 81 is A: and the number 
of times that R appears in 82 is 0 . 

If rj =rk, for each m that satisfies \<m<k, we could construct a unique Kautz string 

= i'k.m+i../kr2../kr2..yk-mi'k-m+i with length 2 A:- 1 : G SI and R appears at No. m place 
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of Therefore, the number of times that R appears in S2 is k-l and the number of 
times that R appears in SI is 0. 

Therefore, for any node R^rjr 2 ...ri; in Kautz graph K{k,d), 

If Ln^k*d'‘+ik-l)*d'‘-'- k. 

If ri=Tk, LR^k*cf+{k-l)*d'‘-'- (k-l). □ 

Lemma 3 For each Kautz string e=rjr 2 ...rkrk+i in KautzSpace(d,k+l), the number of 
times that e appears as the substring of Kautz strings in set S is: 





k * d^- 


' -H (k 


1 

o- 

1 


(T = ''t+i) 


= < 


k* d^- 


‘ -H ( k 


- l)d'‘-^ - k -\-l 


(r, = r,,r, = r,^,) 




k* d^- 


‘ -H ( k 


- \)d^-^ 


(else ) 



The proof of Lemma3 is similar to Lemma 2 and omitted here. 



4 FissionE Sketch 

The Kautz graph has optimal diameter and good fault toleranee eharaeteristie. Also it 
is eonstant eongestion when using long path routing algorithm. Thus the Kautz graph 
is a good statie topology to eonstruet DHT sehemes. We propose FissionE, a novel 
eonstant degree, <9(log/V) diameter and (l+o(l))-eongestion-free DHT seheme based 
on the Kautz graph. 

FissionE adopts Kautz graph K{2,k) as its statie topology. Eaeh peer in FissionE 
owns a zone in virtual 2-dimensional Cartesian eoordinate. The identifiers of zones in 
FissionE are Kautz strings with base 2, and zones are organized aeeording to their 
identifiers. The identifier of a peer is the identifier of the zone it owns. When peers 
join or leave, the “split large and merge small” poliey is adopted for maintenanee. 
Then the entire eoordinate spaee is dynamieally partitioned among the peers in the 
system and the identifiers of zones ehanges dynamieally. 

FissionE is somewhat similar to Fission seheme [8], and the main differenees lie in 
the neighborhood of peers and the routing algorithm as well as the update algorithms. 
In FissionE, the neighborhood invariant (i.e., if zone U and V are neighbors, then 
||C/|-|F||<1) is kept, but there is no brother-edges. An example of FissionE 
neighborhood is shown in Figure 2. Routing algorithm in FissionE is mueh like the 
long path routing algorithm in the Kautz graph, while Fission adopts the short path 
routing algorithm. The maintenanee poliey is similar to that in Fission, but the 
proeedure to find fit zone to split or merge is mueh more eomplex. Some fault- 
tolerant meehanisms are also proposed in FissionE. The details of FissinE and Fission 
are in [8, 14]. 

Now we show some properties of FissionE. The proof and details are in [14]. 
Theorem 2 (Congestion Characteristic) FissionE is (1 +o(l))-congestion-free. 

Theorem 3 (Performance Characteristic) In an N-peer FissionE system, 

1. The in-degree of each peer is 2 and the out-degree is between 1 and 4. The 
average out-degree is 2. 

2. The diameter of FissionE systems is less than 2*log2N. 
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5 Conclusions and Future Work 

The Kautz graph is a good static topology to construct DHT schemes. A novel DHT 
scheme based on Kautz graph, FissionE, is proposed to achieve constant degree, 
O(logA') diameter and (l+o(l))-congestion-free. FissionE is a very promising DHT 
schemes and many topics (such as proximity, heterogeneity, etc.) on FissionE will be 
investigated thoroughly in our further work. 
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Abstract. Based on analysis on multiple packet losses of standard slow start 
caused by exponential growth of congestion window (cwnd), this paper 
proposes a new phase-divided TCP start scheme and designs a parameterized 
model to reduce packet losses and improve TCP performance. This scheme 
employs different of cwnd growth rules while cwnd is under and over the value 
of half window threshold (ssthresh) respectively, namely exponential growth 
and negatively exponential growth, which greatly decreases probability of 
multiple packet losses from a window of data and guarantees that a connection 
smoothly joins the Internet and transforms into congestion avoidance. 
Parameterized model adjusts the duration of slow start and acceleration of 
increasing cwnd to improve performance of slow start phase through various 
parameter setting. An adaptive paraeter setting method is designed. And the 
simulation results show that this new method significantly decreaseds packet 
losses and improves the stability of TCP and performance of slow start, and 
also achieves good fairness and friendliness to other TCP connections. 



1 Introduction 

With expanding of the Internet applieations and seale, TCP is widely deployed, 
providing reliable end-to-end Internet serviees. Statistie data show that 95 pereent of 
data flows belong to TCP on the Internet [1]. Sinee TCP was produeed, researehers 
have done many work on it and proposed several enhaneed variants [2,3,4], TCP 
adopts slide window meehanism to eontrol network eongestion and slow start takes 
effeet while a session starts, the sender opens one segment size eongestion window 
and exponentially inereases cwnd until cwnd reaehes a threshold, ssthresh, therefore, 
slow start effeetively avoids bursty traffie while new eonneetions join network. 

Conneetions are elassified into long-lived and short-lived eonneetions aeeording to 
the duration. Network measurement [1] shows that short-lived flows, sueh as WEB 
and TELNET, are the majority, and long-lived flows, like FTP, are the minority but 
transfer most data paekets of TCP flows. Short-lived flows often end before they 
eome to steady state; namely, they often loeate at startup phase. Obviously, the 
performanee of slow start direetly impaets on the transmission utility of short-lived 
flows. In praetiee, short-lived flows eonvey the data of important Internet serviees, 
whieh require high bandwidth and short delay. In the same way, slow start also takes 
effeet while long-lived flows start and a retransmission timeout takes plaee, thus 
impaets on performanee of long-lived flows. Although slow start’s duration is very 
short, it is of great signifieanee to improve performanee of data transmission. The 
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congestion window of standard slow start doubles with RTT, shown in figure 2 and 
figure 3, the equation follows as (1), 



, T"! _ [2 X cwnd{t), if cwnd{t) < ssthresh\ 
' [ssthresh, if cwnd\t) > ssthresf 



( 1 ) 



Researchers have proposed many enhanced schemes to improve slow start, M. 
Allman [5] has set a larger initial window(LIW) to improve the performance; TCP 
Fast-start [6] records network parameters of network recently to reduce the start time 
of a new connection, decrease short abrupt transmission delay, and maintain high 
utility while network keeps steady. SPAND [7] picks up current network state and 
gains optimal initial parameters. J. Floe’s method replaces the default setting of 
ssthresh with an estimated value to ensure that cwnd reaches an appropriate value [8]. 
That recent history information is used to initialize parameters of new connections has 
been presented in [9]. TCP Vegas [10] restricts the exponential window growth, and 
doubles cwnd every other RTT. The above approaches partially optimize slow start, 
but each still has weakness. TCP Fast-start has strict condition of steady network. J. 
Floe’s method is problematic in practice. Parameter setting based history information 
violates slow start principle and cannot fit dynamic change of network. TCP Vegas 
cannot avoid multiple packet losses in one window. 

Considering the limitations, we propose a new phase-divided and gradually 
approaching slow start algorithm, called P-Start. P-Start employs standard slow start 
mechanism and cwnd grows exponentially while congestion window is less than 
ssthresh/2; If cwnd is equal or greater than ssthresh/2, the congestion window, cwndi+ 
], is not directly set with ssthresh but only increases {ssthresh-cwnd^!2, and iterates 
until (ssthresh-cwndi) is less than the factor of 5 that lies from 2 to ssthreshl2. This 
approach combines exponential and negatively exponential growth, of which 
congestion window gradually approaches ssthresh, to improve stability of TCP and 
decrease multiple packet losses. The rest of the paper is organized as follows. Section 
2 proposes phase -divided and gradually approaching slow start algorithm and the 
parameterized model; Section 3 validates the algorithm through a series of simulating 
experiments; Section 4 gives the conclusion and points out further research direction. 



2 P-Start 

2.1 Motivation of P-Start 

TCP slow start probes network bandwidth, and its inherent property of exponential 
growth of window from one packet results in the following problems: first, congestion 
window starts with one packet and spends many RTTs, which brings low utility of 
short-lived connections. So researchers propose a larger initial window and adopts 
fast start [5,6] to reduce the duration of slow start and improve network utility. 
Second, the source nodes are blind to the available bandwidth and use the default 
initial ssthresh. The exponential growth of congestion window results in severe 
overflow of the buffer of the bottleneck link and multiple packet losses, which makes 
self-clock loss and retransmission timeout of TCP. And retransmission timeout causes 
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global synchronization, which greatly degrades network utility and brings oscillation 
of queuing delay. TCP restarts and regains self-clocking. 

Efficient measurement technology, which is used to probe available network 
bandwidth, adaptively set slow start ssthresh, and eliminate the limitation of static 
parameters setting, is potential method. For dynamics of data flows and delay of 
system feedback, even though, effective bandwidth measurement cannot gain entire 
match between ssthresh and available bandwidth. In the last RTT, the send window 
increase near ssthresh!!, the largest increment, but the sending rate is close to 
network’s capability. The over increment of window causes multiple packet losses. 





Fig. 1. Cwnd acceleration 



Fig. 2. Comparison of startup mechanisms 



The growth of window of standard slow start is shown as curve a in figure 1 ; the 
increment of window becomes larger and larger. In fact, while a connection starts, it 
should gradually increase congestion window from a small one to avoid large initial 
window, which brings bursty traffic and causes network congestion. When sending 
rate is near network capacity, the increment should be reduced to smoothly transform 
into congestion avoidance, which is shown as curve b in figure 1 . 



2.2 Elements of P-Start 

The key idea of P-Start is that congestion window increases exponentially while it is 
less than ssthresh!!, otherwise, increases (ssthresh-cwnd)!! and gradually approaches 
ssthresh until the {ssthresh-cwnd) is less than the factor of 5 (ssthresh/2 -5= 6 ^2), and 
then cwnd is set with ssthresh and transforms into congestion avoidance phase. In 
contrast to TCP Vegas, P-Start has the same duration when the factor is set with 2, the 
low bound. It means that the longest process of P-Start is same as long as TCP 
Vegas’s, but P-Start can efficiently decrease probability of multiple packet losses for 
cwnd increment of P-Start becomes smaller and locates middle phase of startup. And 
cwnd can be represented as (2), shown in figure 2 and figure 4. 

1 2 X cwnd {t), if cwnd < ssthresh / 2 

(cwnd(t) + ssthresh) / 2, if ssthresh - cwnd > 5 
and ssthresh > cwnd > ssthresh / 2 
ssthresh, else 

The major feature of P-Start is that cwnd increases with a small amplitude at start 
phase and transition phase to congestion avoidance phase, sending rate changes 
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smoothly with little impact on other shared connections, and maintains network 
stability, decreases oscillation. Algorithm is shown as following, 

1. init , cwnd=l , ssthresh=ssth.^.^, reset 5 ; 

2. send cwnd packets; 

4. if cwnd< ss thresh/ 2 , then cwnd=2*cwnd, goto 2 ; 

5 if cwnd> ss thresh/ 2 and ssthresh- cwnd> 5 then 

cwnd= (cwnd+s s thresh) / 2, goto 2 ; 

5. else cwnd= ssthresh, goto 5; 

6. if received 3 dup-ACKs or retransmision time out, 
then re-enter slow start phase, goto 1; 

7. enter congestion avoidance phase; 

We compare the window change between standard slow start [12] and P-Start. 
First, we assume that all segments are successfully acked and round trip time is fix. 

Let A + 1 = logj”^*^^* , So the duration is equal to = (A + 1) x RTT according to 
slow start elements. Increment of window are shown in figure 3, in the first N-1 RTTs, 
the total increment is ssthreshIA', In the last two RTTs, cwnd increases ssthreshIA and 
ssthresh!! respectively, total increment of the last two RTT reaches 3 quarters of 
ssthresh. In high-speed network, congestion window size is very large; The bursty 
traffic caused by great imcrement of cwnd should result in great impact on network 
stability, packet loss of all connections sharing the bottleneck link, global 
synchronization and degrading of performance. P-Start is shown in figure 4, while 6 
=!, the largest increment of cwnd is equal to ssthreshl4 and occurs twice during slow 
start, and bursty traffic reduces by 50%. Congestion window changes smoothly, the 
utility of P-Start is lower than standard slow start’s, which is shown in figure 2. But 
P-Start can maintain network higher utility while network congestion occurs by 
decreasing packet losses. P-Start is a general slow start, which can be applied in wider 
areas, through varying 6 from 2 to ssthresh!!. If 6 = ssthresh!!, P-Start turns to 
standard slow start. P-Start reduces its duration and improves utility if 6 is set with the 
value greater than 2. 
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Fig. 3. Window change of slow start of TCP 



Fig. 4. Window change of P-Start ( 5 =2) 



2.3 Flexible Parameter Setting Model 

According to the statement of section 2.2, if 6 =!, P-Start has the same duration as 
TCP Vegas’s, however, P-Start diminishes the granularity of increment of cwnd, and 
significantly improves the smoothness of sending rate change, shown in figure 2; and 
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if 6 =ssthreshH, P-Start becomes standard slow start. P-Start can flexibly select the 
value of 6 from ssthreshll to 2. Compared with standard slow start, the behaviors are 
same while cwnd<ssthresh/2, exponential window growth, but congestion window of 
P-Start is negatively exponential growth of the difference between ssthresh and cwnd, 
represented as (2). The duration is approximate the double of standard slow start’s, 
which causes low performance at startup phase, shown in figure 2. So it is necessary 
to improve the performance and synchronously keep smooth transition to congestion 
avoidance. An increment factor ofY ( Y ^1) is introduced, which means that cwnd 
increases Y packets if one packet is successfully acked. Cwnd becomes ( Y +1) times 
of the former one during one RTT if all packets are successfully acked, so this factor 
Y represents the acceleration of cwnd growth. For example, if Y =1, 6 =2, P-Start is 
the algorithm stated in section 2.2, if Y =3, 5 =2, and cwnd<ssthreshl2, cwnt/ becomes 

. . , • • 1 i n 1 1 1 I • /{ssthresh - cwnd) 

4 times as the original one, else ii cwna ^ssthreshll, cwnd mereases 

every RTT, namely 3{ssthresh-cwnd)/4, until (ssthresh-cwnd)< 6 , which is shown in 
figure 2. So cwnd of P-Start can be computed as (4), 

(;' + 1) X cwnd{t), if cwnd < ssthresh / 2 

y y. ssthresh + cwnd{t) „ , , ... 

, ij cwnd > ssthresh / 2 

7 + 1 

and ssthresh - cwnd > S 

ssthresh, if cwnd > ssthresh or ssthresh — cwnd < 5 

The performance of the above parameterized model is determined by the static 
parameter setting of Y and 5 , P-Start should be more adaptive to the dynamics of 
Internet. As large initial window has been proposed to improve the performance of 
slow start with the same effect of the factor 6 in P-Start, so the value of 6 is decided 
by large initial window option. For the reason of performance, startup phase cannot 
last too long, the time taken by various slow start mechanisms are shown in table 1 . 




cwnd{t + T) = 



Table 1. duration of startup phase {RTT is fix & all packets are successfully acked) 









P-Start 
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Generally, the duration is no longer than 10 RTTs, P-Start takes time, 
= 2RTT , and ssthresh can be gained by the method of J. Floe [8]. If 

standard slow start takes time less than 5 RTTs, namely iog7'"^“*<5, P-Start set Ywith 
1, else if 5<log2''*''“*<l0, P-Start sets Ywith axiog7*''“*/5,a>l, else Y follows equation 
(4). And the adaptivity of P-Start needs further research. 



t (ssthresh / 2-S) 



<5 



( 4 ) 
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3 Network Simulations and Validation 

To validate P-Start, we implement it in network simulator NS2 [13] in environment of 
red hat 9.0 linux, network topology is shown in figure 5, and simulation scenarios 
includes various mechanisms, such as TCP standard slow start, TCP Vegas and P- 
Start with different parameter setting. We design a series of experiments and compare 
the simulation results to figure out the advantage of P-Start. 




Fig. 5. Topology of network simulation 



As shown in figure 5, N connections share a bottleneck link consisted of routers 
Rl, R2, bandwidth of the bottleneck link is 1.5Mbps. Time labeled in figure 5 is one- 
way delay, packet size is 1024 Bytes, buffer size L is 25 packets size, and type of data 
transferred in network are FTP. 

In experiment one, link is injected 0.5 Mbps background traffic, segment- 
discarding strategy of bottleneck link is Droptail. The TCP connection we measured 
has a fixed share of 1Mbps and 8 buffer units. According to network transmission 
property, product of network bandwidth and delay (BDP) can be computed as follow, 

BDP = bandwidthxRTT = 1 .OM x 2(2 + 45 -I- 3) = 1 2500Bytes 

BDP is approximate 12 packets size, and the available buffer size is 8 packets size, 
so the pipe’s capacity is 20 packets. The file size transferred is 60 Kbytes. In TCP 
protocol, the window of the sender is the smaller one of cwnd and the receiver 
announced window rwnd. Rwnd is set 18,20,24,36,64 packets size, of which value is 
set greater, equal and less than actual network bandwidth. Relation between packet 
losses and window size of various slow start mechanisms is shown in figure 6. 




Fig. 6. Packet losses vs. rwnd 
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Table 2. comparison of utilities 



Time 


Slow start 




P-Start 


1 V egas 


Y=1,8=2 


Y=3,8=2 


2 Sec 


48.6% 


35.1% 


42.7% 


69.6% 


4 Sec 


74.3% 


78.2% 


68.9% 


75.6% 


6 Sec 


84.1% 


91.3% 


77.2% 


85.2% 


8 Sec 


87.5% 


93.5% 


85.4% 


88.3% 



Throughput of slow start is of great significance to performance of services, 
especially to WEB services. Table 2 shows utility of slow start, the simulation results 
indicate that standard slow start has better performance than TCP Vegas and P- 
Start( y =1, 5 =2), and has worse utility than P-Start( Y =3, 5 =2). P-Start gains better 
utility, TCP Vegas has better utility while TCP is at congestion avoidance phase. P- 
Start can achieve different utility responding to different parameter configuration. 
And more packets are dropped while P-Start combines with large initial window of 4, 
fewer packets are dropped and better performance is achieved while P-Start combines 
with J. Hoe’s method. Considering the space, the detail is ignored. 

In order to check the impact on network stability of slow start, we monitor the 
bursty traffic and instantaneous queue length of bottleneck link if the segment 
dropping strategy is RED(Random Early Detection). Slow start and P-Start( y =1, 6 
=2) start a new FTP connection every 0.5 second and end 0.7 second later. The 
instantaneous queue length of bottleneck link is shown in figure 8, P-Start can 
effectively maintain the stability of network for the relatively smoother change of 
window, which causes smaller oscillation. 




Fig. 7. Comparison of instantaneous queue length 

Experiments show that P-Start can effectively decrease packet losses, lighten 
network oscillation, and improve performance of network with appropriate parameter 
setting, and the adaptive parameter setting requires further research. 

Fairness reflects bandwidth allocation among various connections in the bottleneck 
link, we use Jain’s Fairness Index represented as following [15] : 

(Z’l 

Fairness Index = ( ^ | 

Where v, is the throughput of the ith flow and n is the number of total flows. The 
fairness index is ratio, which lies between 0 and 1 . The upper bound value of 1 shows 
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that all flow share the same bandwidth of bottleneck link. Simulations with different 
number of flows are put up to test fairness of P-Start. We calculate the Fairness Index 
for TCP Reno and P-Start for each simulation, and the average Fairness Index is 
0.9945 and 0.9949 respectively. Simulation results show that P-Start only improves 
the performance of flows at the start and has no impact on other phases of flow. 



4 Conclusion 

This paper analyses the impact of TCP slow start on network transmission and 
difficulties of deployment effective and stable slow start, and proposes a new phase- 
divided slow start mechanism. This mechanism adjusts the rule of congestion window 
change, fast and smoothly transforms from slow start to congestion avoidance and 
decreases the damage caused by multiple packet losses. It has no effect on TCP 
congestion avoidance but benefits short-lived flows and flows in long fat pipes. The 
further research is to apply effective and accurate bandwidth measurement technology 
to dynamically set proper threshold value of slow start, which get over the limitation 
of static parameter configuration and enhance adaptivity to improve network utility. 
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Abstract. As an important metric for wireless ad hoc networks, net- 
work lifetime has received great attentions in recent years. Existing 
lifetime-aware algorithms have an implicit assumption that nodes are 
cooperative and truthful, and they cannot work properly when the net- 
work contains selfish nodes. To make these algorithms achieve their de- 
sign objectives even in the presence of selfish nodes, in this paper, we 
propose a truthful mechanism Second-Max-Min (SMM) based on the 
analysis of current algorithms as well as a DSR-like routing protocol for 
the mechanism implementation. In SMM mechanism, the source node 
gives appropriate payments to relay nodes, and the payments are related 
to the path which has the second maximum lifetime in all possible paths. 
We show that the payment ratio is relatively small due to the nature of 
lifetime- aware routing algorithms, which is confirmed by experiments. 



1 Introduction 

Power-aware routing is a key concern for wireless ad hoc networks due to the 
limited battery power of nodes. Current research on power-aware routing mainly 
focuses on two aspects: minimizing the consumed energy of communication (i.e. 
energy-efficiency) [8] and maximizing the lifetime of whole network (i.e. lifetime- 
aware) [4-6]. An energy-efficient routing protocol tries to find a path which has 
the minimal consumed energy. However, the nodes in the minimal energy path 
will be drain-out of energy quickly if all the packets are routed along this path. 
Therefore, it would be better to route along nodes which have a higher residual 
energy, which is discussed by lifetime-aware routing protocols. 

Most previous works on power-aware routing have implicitly assumed that 
nodes are cooperative and truthful. A cooperative node means the node is willing 
to relay packets for other nodes. A truthful node means the node will reveal its 
private information, such as its residual energy etc. However, this assumption 
cannot be taken for granted from the view of an individual node. A node may 
tend to be selfish, refuse to relay packets for other nodes, or do not tell the truth 
for its own benefit. 

Several protocols have been proposed to stimulate the cooperation of nodes 
(see [9] for a survey). Further, [2] proposed Ad hoc-VCG, an reactive routing 
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protocol coping with the selfish nodes while also achieving the desirable goal 
of truthfulness and energy-efficiency. However, as we have pointed out, lifetime 
is also an import metric for network. In the face of selfish nodes, how to make 
existing lifetime-aware routing algorithms achieve their design objectives is an 
imperative problem to be solved. But to the best of our knowledge, few works 
have addressed this problem. 

In this paper, we study existing lifetime- aware routing algorithms, and pro- 
pose a truthful mechanism SMM based on the analysis of current algorithms. Our 
mechanism deals with selfish node within the framework of algorithmic mecha- 
nism design [3]. By giving appropriate payments to relay nodes, the mechanism 
ensures existing algorithms work properly even the nodes in network are selfish. 
We also present a DSR-like routing protocol to implement SMM mechanism. 

The rest of the paper is organized as follows. Section 2 reviews some related 
works. Section 3 presents our problem, and analyzes existing solutions. Section 
4 proposes the SMM mechanism and presents a DSR-like protocol for the imple- 
mentation of our mechanism. Section 5 proves the truthfulness of SMM mecha- 
nism. Section 6 conducts experiments to examine the performance of payment 
ratio. We conclude our work in sect. 7. 

2 Related Work 

Several lifetime-aware routing algorithms which do not consider the effect of 
selfish nodes have been proposed. MMBCR [4] tries to avoid the path with nodes 
having the least battery power among all nodes in all possible paths. MRPC [5] 
identifies the capacity of a node not only by its residual battery power, but also 
by the expected energy spent in reliably forwarding a packet over a specific link. 
It selects the path which has the largest packet capacity at the critical node. LPR 
[6] minimizes the variance in the remaining power of all the nodes and thereby 
prolong the network lifetime. Other algorithms like CMMBCR [4] and CMRPC 
[5] can be viewed as a conditional variant of algorithms mentioned above. 

To make existing algorithms continue to work when the nodes in network are 
selfish, we adopt the framework of algorithmic mechanism design. Algorithmic 
mechanism design considers the problems in a distributed environment where 
the participants cannot be assumed to follow the algorithm but rather their own 
self-interest. [3] proposed a formal model for such problems. It can be described 
as following: 

In a distributed environment, there are n agents. Each agent i has some pri- 
vate information t*, called its type. For a mechanism design problem, there is 
an output specification that maps each type vector to a set of allowed outputs 
o G O. Agent Ps preferences are given by a valuation function u(o, P). A mecha- 
nism defines a family of strategies A* for each agent i. For each strategy vector 
(a^, . . . , a"), a* G A*, the mechanism computes an output o = o{a^, . . . , a"), and 
a payment vector p = p{p^, . . . ,p"). Agent Ps utility is = p~^ — v{o,V'). It is Ps 
goal to maximize its utility. A mechanism is called truthful if for every agent i of 
type t* and for every strategies of the other agents, Ps utility is maximized 
when it declares its type P. 
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Several standard problems have been studied as the mechanism design prob- 
lem [3]. In the context of wireless ad hoc network, [2] applies the mechanism 
design theory to ad hoc energy-efficient routing problem. They proposed a re- 
active routing protocol Ad hoc-VCG, which achieves the design objectives of 
truthfulness and energy-efficiency by paying to the intermediate nodes a pre- 
mium over their actual costs for forwarding packets. 



3 Problem Statement and Analysis 

In this section, we present the mechanism design problem for lifetime- aware 
routing in wireless ad hoc networks by considering the selfish nodes. 

In a wireless ad hoc network, there are m mobile nodes, each of which has 
a unique identification and belongs to different users. From the view of a node, 
it is selfish but economically rational, and its objective is to maximize its own 
benefit. A rational node means the node is willing to forward packets for others 
only when it can get payments equal or greater than what it desires. Now, a 
source node S wants to send a message to a destination node D. There are n 
possible paths can be found between S and D. Our problem is to select a path 
from these n possible paths to maximize the lifetime of network and ensure the 
truthfulness of this selected path. 

Section 2 has reviewed some solutions that do not consider the impact of 
selfish nodes. After pondering existing algorithms, we find these algorithms can 
be represented by a common form as following: 

Let function g{) be the common representation of lifetime of all nodes. The 
factors, such as the residual battery power of node and the transmission power 
between nodes, can be used as the parameters of g{). We treat the minimal 
lifetime of nodes in a path as the lifetime of the path, and select the path which 
has the maximal lifetime as the output path, i.e. the output path o can be 
obtained from the equation: 

o = Max{Min{g{R), . . .))), (1) 

jeA i^j 

where i?* is the residual battery power of node i in path j, and A is the set of 
all possible paths. 

As we have pointed out, a node may tend to be selfish. To prevent its battery 
power is consumed for other nodes, a node may refuse to relay packets, or declare 
a very low lifetime so that it cannot be selected as the relay node. In this case, 
existing algorithms will fail to work. Our objective is just to design a mechanism 
which ensures to route along the path selected by (1), and the path is truthful. 

4 SMM Mechanism and Protocol 

In this section, we propose SMM mechanism to cope with the selfish nodes. We 
also present a DSR-like protocol to implement our mechanism. 
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4.1 SMM Mechanism 

To design a mechanism, we provide the output function of mechanism, define 
practical valuation function of nodes, and present appropriate payment function. 

The output function o() is given first. According to the lifetime declaration of 
each node (a node can declare its lifetime at will), the output function o() selects 
a path from all possible paths by using (1). It can be represented as following: 



o{a} , . . . = Max{Min{a))), (2) 

j&A i^j 

where a* is the lifetime declaration of node i, a* is the lifetime declaration of 
node i in path j. 

The valuation function of nodes is defined as following: 



t>(o, f) 



0 : i ^ o 
§ : f GO ’ 



( 3 ) 



where t* is the lifetime of node i, and o is the output path. 

It means that i’s evaluation is zero if i does not belong to o, and i’s evaluation 
is inversely proportion to its lifetime if i is one of the node in o. Intuitively, the 
shorter lifetime i has, the more likely that i is not willing to forward packets for 
other nodes. Therefore, i would expect more payments. 

The goal of a node is to maximize its utility, so it tends to choose favorable 
strategy and become truthless. If a node declares false lifetime, o() may select 
an improper path. We have to design an appropriate payment function which 
can meet the needs of nodes while compatible with the output of algorithm. 

We treat the minimal lifetime declaration of nodes in path j as j’s lifetime 
declaration. Assume that the nodes whose lifetime declaration is minimal in all 
n possible paths are qi, . . . , then the lifetime declarations of these n paths 
are af^ , a®" - We can simply denote them as oi, . . . , a„_i, a„. Without 
losing generality, we assume oi < . . . < fln-i < On- The output would be the 
path n. The payment function p() is defined as following: 



P(a}) 




Vj j 

j = n,i € n ’ 



( 4 ) 



where c is a constant. 

It means that the payments to nodes which do not belong to the output path 
would be zero, and the payments to nodes in the output path are related to the 
path which has the second maximum lifetime declaration in all possible paths. 

We call our mechanism the Second-Max-Min (SMM) mechanism and will 
prove the truthfulness of SMM mechanism in the next section. 



4.2 Protocol 

DSR [7] protocol is an reactive routing protocol which is used to find the shortest 
hop path between source and destination. To meet the need of the implementa- 
tion of our SMM mechanism, we make some modifications to DSR protocol. 
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First, the lifetime declaration of each intermediate node, which is equal to 
its type through SMM mechanism, is recorded in the request packet. Type is 
private information for each node. To prevent a node’s type information from 
being known or altered by other nodes, we adopt a PKI-based security model. 
In this model, the keyed encryption algorithm is known to all the nodes in 
the network, the encryption and decryption keys are generated by S. When S 
starts the route discovery phase, it puts the encryption key in the route request 
packet. Every intermediate node uses the encryption key in the received route 
request packet and the public encryption algorithm to encrypt its private lifetime 
declaration. After receiving the route reply packet, S uses the decryption key to 
decrypt the lifetime declaration of each intermediate node in the packet. 

Second, instead of selecting the shortest hop path, we try to choose a path 
which can maximize the lifetime of network. When S wants to send a packet to 
D, it starts a timer and launches the route discovery phase. During a period of 
time T, S may receive several possible paths from the destination. Each path has 
the information of lifetime declaration of nodes in the path. S can choose a path 
from these paths by using the output function o() and calculate the payments 
to each node in the selected path by using the payment function p(). 

Third, we avoid the route cache optimization techniques used in DSR. The 
cached routes cannot represent the current state of nodes because every node’s 
type keeps changing. In our implementation, the source node periodically re- 
freshes its cache and triggers a new route discovery process, the intermediate 
node does not respond to the route requests with cached routes. 

Last, unlike DSR, a node processes the route request packet even it has seen 
the request before. We cannot simply discard the packet because the later arrived 
packet may have longer minimal lifetime declaration. Therefore, we ignore the 
judgment whether the node has seen the request before. To prevent heavy traffic, 
node will discard a packet if it has seen the request more than several times. 

5 SMM Mechanism Analysis 

In this section, we will show that the design of SMM mechanism can ensure the 
routing algorithm get its desired output in the presence of selfish nodes. 

First, our mechanism guarantees the voluntary participation of all nodes. If 
a node can get payments equal to or greater than its valuation, it is willing to 
participate in the protocols. It can be shown that no matter a node belongs to 
the output path or not, its utility is non-negative in our mechanism. 

Second, our mechanism is truthful. It is clear that if all nodes declare their 
type (a* = tp, our mechanism will guarantee the source node chooses algorithm 
desired path. Here, we prove that SMM mechanism is truthful. 

Theorem 1. SMM mechanism is truthful. 

Proof. To prove our mechanism is truthful, we need to show that every node 
cannot get more utility than what it gets when declaring its true lifetime, that 
is cheating cannot increase the utility of a node. We get it in two steps: 
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First, the lifetime declaration of each path is truthful. We treat the nodes in 
a path as an entity, and consider the behavior of the path. Assume that there 
are n possible paths between S and D, and the lifetime of these n paths are 

G.1 , . . . , ttn — 1 , (^1 ^ ^ ^ n —1 ^ • 

— The path n can be selected as the output path if it declares its true lifetime, 
its utility u is ~ X)fc=i — 0, where In is the number of nodes except 

S and D in path n. Now path n declares a false lifetime a^. If olf > a„_i, 
n still can be selected as the output path, and its utility does not change. If 
aH < a„_i, n cannot be selected as the output path, and its utility is zero. 

— The other paths cannot be selected as the output path if they declare their 
true lifetime, their utility are zero. Now a path j declares a false lifetime 
aj. If Oj < a„, j still cannot be selected as the output path, its utility will 
not change. If oj > a„, j can be selected as the output path, its utility 
u' is ^ — X)fc=i but its expected utility u is larger than u' because 

" “3 

u > — — -fr > u'. Therefore, there must exist some nodes in path j, 

aj ^ k 

J 

such as the node with the minimal lifetime, their utility decrease. 

Second, the lifetime declaration of each node in a path is truthful. We consider 
a node i in path j. No matter i is the node which has the minimal lifetime in 
path j or not, i cannot get more utility if it declares a false lifetime. The analysis 
is similar as the first one, we omit it here. 

To measure the payment, we define payment ratio: Lets path j be the output 
path, payment ratio is the ratio of payment for path j to valuation of path j. 
We have following theorem for payment ratio. 

Theorem 2. For SMM mechanism, let path j he the output path, which has 
the maximum lifetime in all possible paths from S to D; and path s has the 
second maximum lifetime. Let Max{Oj) denote the maximal lifetime declaration 
of nodes in path j, Min{Oj) denote the minimal lifetime declaration of nodes in 
path j, and Min{a\) denote the minimal lifetime declaration of nodes in path s, 
then: 



Min(a\) Maxia\) 

^ < 0 < (S') 

Min{ai) ~ ~ Min{a\) 

Proof. We omit the proof due to limitations of space. 

The payment ratio [3 can be used as an important metric to the performance 
of mechanism. If 0 is close to 1, the premium that the source node pays to 
intermediate nodes is low. It means that the mechanism achieves the design 
objective of algorithm at little additional cost. While (3 is far more than 1, the 
premium that the source node pays to intermediate nodes is high. It means that 
the mechanism achieves the design objective of algorithm at high additional cost. 
The essence of a lifetime-aware routing algorithm is to distribute the power 
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consumption evenly among nodes, which leads to the result that the lifetime 
of nodes has the tendency of closing to each other. From Theorem 2, we can 
conclude that (3 is close to 1 when the maximal lifetime of nodes in the output 
path is close to the minimal lifetime of nodes in path which has the second 
maximum lifetime. Therefore, we can infer that SMM mechanism has excellent 
payment ratio, which is relatively small and stable. 

6 Experiment 

We conducted experiments to evaluate the payment ratio of SMM mechanism. 
The simulation consisted of a network of 50 nodes randomly distributed over a 
700 X 700m^ area. We used the CBR traffic at 4 packets per second, and the 
packet size was 512 bytes. Random connections were established. The source 
node refreshed its cache every other 10 sec. Each node was given enough battery 
power to finish the experiments. The initial values of battery power in all nodes 
are same. A node could dynamically adjust its transmission power based on the 
link distance d, and the transmission cost /i is AT • d“ , where a is the signal loss 
exponent. Two lifetime-aware routing algorithms, MMBCR and MRPC, were 
implemented. We try to find the influence of different parameters (such as the 
link distance d, the signal loss exponent a) on payment ratio. 

In Fig.l, we present the payment ratio in MMBCR when a = 2. It can be 
observed that the ratio payment is very small and close to 1. We compare the 
situations when the maximum transmission range R of nodes is 150m, 200m 
and 250m respectively. The effect of transmission range increment lies in two 
aspects: (1) Each node covers more nodes, so there are more possible paths 
between the source node and the destination node. It will increase the balance 
of traffic on nodes and reduce the lifetime variance between nodes; (2) The 
range of transmission cost will increase due to h = K ■ d°‘ , which increases the 
lifetime variance between nodes. In Fig.l, we can find that the payment ratio 
for R = 150m is higher than the payment ratio for R = 200m and R = 250m. 
This result can be viewed as the effect of the first aspect. The payment ratio for 
R = 200m is close to the payment ratio for R = 250m, which can be viewed as 
the balance between these two aspects. 

In Fig. 2, we present the payment ratio in MRPC when R = 150. In MRPC, 
the lifetime of a node is the ratio of its residual battery power to its transmission 
cost (we does not consider the link’s packet error probability) . Though the initial 
battery power of all nodes is same, the transmission cost of nodes is different 
because the transmission cost relates to the link distance. This experiment can be 
viewed as a simulation of the initial lifetime of all nodes is different. From Fig. 2, 
we see that the payment ratio increases with the increment of a. It is because 
that the higher a, the higher the difference between initial lifetime of nodes. As 
we have pointed out, a lifetime- aware routing algorithm tries to minimize the 
variance of lifetime between nodes to increase the lifetime of network. It can be 
observed in Fig. 2 that the payment ratio decreases with time. 
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Fig. 1. /3 vs. 7? in MMBCR 



Fig. 2. f3 vs. a in MRPC 



7 Conclusion 

In this paper, we dealt with the problem of maximum lifetime routing in ad hoc 
network with selfish nodes. By applying the framework of algorithm mechanism 
design, we designed a mechanism SMM. The basic idea of our mechanism is giv- 
ing appropriate payments to stimulate the cooperation of nodes, and cheating 
can not increase or even lose the utility. In SMM mechanism, the payments to 
nodes in the output path are related to the path which has the second maxi- 
mum lifetime in all possible paths. We proved that SMM mechanism is truthful, 
and purposed a routing algorithm to implement SMM mechanism. Finally, we 
conducted experiments to evaluate the performance of our mechanism. 
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Abstract. Continuous Event Graphs (CEGs), a subclass of Continuous Petri 
Nets, are defined as the limiting cases of timed event graphs and Timed Event 
Multigraphs. A set of dioid algebraic linear equations will be inferred as a novel 
method of analyzing a special class of CEG, if treated the cumulated token 
consumed by transitions as state-variables, endowed the monotone 
nondecreasing functions pointwise minimum as addition, and endowed the 
lower-semicontinuous mappings, from the collection of monotone 
nondecreasing functions to itself, the pointwise minimum as addition and 
composition of mappings as multiplication. As a new modeling approach, it 
clearly illustrate characteristic of continuous events. Based on the algebraic 
model, an example of optimal Control is demonstrated. 



1 Introduction 

It is well know that max-plus algebra is powerful tools for Diserete Event Dynamie 
Systems (DEDS) [1-3]. Min-plus algebra is the dual of max-plus algebra, and they are 
both dioid, an idempotent semiring. 

Linear model is so popular that it is adopted by most of classie eontrol theory. 
Many evidenees support that max-plus algebra, min-plus algebra, dioid and some 
other idempotent semirings are appropriate mathematie tools to deseribe the 
phenomena of DEDSs, espeeially synehronization. With the operations of above 
algebraie systems, some logie nonlinear formulae eome to linear formulae. As a by- 
produet of the linearization, the existenee of an eventual periodie regime is readily 
obtained, the performanee being charaeterized in terms of invariants of the original 
net [3]. 

An event graph is a Petri net sueh that eaeh plaee has only one input are and one 
output are. Timed Event Graphs (TEG) are a subelass of event graphs satisfied that 
the tokens have to stay a minimum amount of time in the plaees. The dynamie of 
TEGs ean be represented as a max-plus linear algebraie model [1]. Timed Event 
Multigraphs (TEMGs), eaeh are has an integral weight, are extensions of TEGs. The 
max-plus linear algebraie model of TEMGs was built by G. Cohen [2] and HuaPing 
Dai [3] in different way respeetively. By those linear formulae, the analyses of TEGs 
and TEMGs are analogous to traditional linear system theory. 
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Comparing with TEGs or TEMGs, Continuous Event Graphs (CEGs), proposed by 
R. David and H. Alla [4-6], have eontinuous plaees and eontinuous transitions. As a 
limiting ease of TEMGs, CEGs are applied to deseribe the eontinuous events system 
[7-11], for example the flow eontrol and eongest eontrol of TCP [12]. Unfortunately, 
the building of algebraie model for CEGs is more diffieult than TEMGs. G. Cohen [2] 
induee a min-plus algebraie model eonsidered only the ease of eontinuous plaees, but 
no the ease of eontinuous transitions. There is no universal algebraie model is 
eonstrueted for CEGs so far. 

In Seetion 2, the definition of CEG is reviewed. Some properties of CEGs will be 
presented in Seetion 3. A dioid linear algebraie representation of a elass of CEGs is 
given in Seetion 4 and an example will taekled in Seetion 5 with the new model. 



2 Definition of CEG 

Let R is the set of real, tu - -i-oo , R = R [J {m} and R^ = [0, -i-oo) U {m} . 

Definition 1. A CEG is a 6-tuplet: < P,T,R,W,V,Mq > ; T* is a set of eontinuous 
plaees; T is a set of eontinuous transitions; R = Rpj U Rpp , where Rpj. ^PxT and 
Rpp ^TxP , <«[,«2>e7? represent that «j is a input of «2 ^nd «2 i® ^ output of 
«[ ; IE : R ^ R^ is the weight of R ; F : T ^ R^ is the maximum firing veloeities of 
transitions; Mq : P — > R^ is the initial markings of plaees; every plaee has only one 
input transition and one output transition; moreover we assume that every transition 
have at least one input plaee and one output plaee. 




CEGs ean be illustrated by direeted graphs (Fig. 1) that denote eontinuous plaees 
as double eirele, denote eontinuous transitions as reetangle, and denote the elements 
of R as arrow. 

Definition 2. Let t be a transition, °t = {p e P\< p,t >e Rp^} is ealled the preset of 
t ,t°={p eP\<t,p>eRpp} is ealled the postset of t ; Let p be a plaee. 
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° p = {t & T \<t,p >e Rjp) is called the preset of p , p° = {t & T \< p,t >e Rpj.) is 
called the postset of p . 

Definition 3. Let Mark{p, t) denote the token of place p at the epoch t , V(t, t) 
denote the firing velocity of transition t , then: 

1) F(t,0)=0; 

2) if V/7 e °t Mark(p, r) > 0 , then V (t, r)- V (t) ; 

3) if 3p e °t Mark{p,z) = 0 , denote Tq = {t^ p \ p & °t and Mark{p,z) = 0} , 

then V{t,z)-mm{mm{V{°p,z)>^W{°p,p)IW{p,t)},V{t)} . Note that if there is a 

peJo 

loop satisfied that for any place p in the loop Mark{p, z) - 0 , then for any transition 
t in the loop, V(t,z) = 0 . 

Remark 1. The firing of transition t can lead change of the token of some places: 

1) if pet° , then JV{t, p)V{t,z) ■ dz represents the mark input p from t in 
time interval [ Tj , z "2 ]; 

2) if /?e t, then -W(p,t)V(t,z)-dz represents the mark input t from p in 

Jt2 

time interval [ z"i , Z "2 ]. 



3 Properties of CEGs 



t 




Fig. 2. A continuous transition 



Definition 4. : [0,+oo) ^ is a monotone nondecreasing function: 

1) if t e r , M;(r) is the total mark t consumed from the epoch 0 to r , namely 

Mfiz)=^V{t,u)du ; 
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2) if p G P , M p (r) is the total mark entered p from the epoch 0 to r , namely 

Mp{r)^MQ{p)+^^V{t,u)du ; 

Then we consider transition t = {p^, P 2 , ■■■ , p„) t° - {qi,q 2 , ■■■ ,q„} as 

Fig.2. 

Proposition 1. For the input places of t , 

Mp.iT) = Markip.,T) + M,{T)W{pi,t) i = l,2,... ,n ; (1) 

for the output places of t , 

(r) = (0) + M, (r) W(t, q,) i = 1,2, - ,m . (2) 



Proposition 2. Let t gT , there exists Ar > 0 satisfied 

M,{t + At) = M, {t) + miw{V {t) ■ At, M p,{r + At)- M,{t)- W(pj,t), i = 1, ... ,n} ■ 

Proof. Assume that Mark{pi,z) >0 i = 1,2, ... ,n . let 

. Mark{pi,z) 

0 < Az < mm{ } , 

V(/1,,0F(0 

then V (t, z) = V (t) holds in time interval [ r , r + Ar ]; and 

(z + Az)-M, (z) ■ fV(p, ,t)>M^^ iz) - M, {z) ■ W{p, , t) > V(t) ■ Az , 



(3) 

(4) 

(5) 



thus (3) satisfied. 

Now consider the case that 3p e °t Mark{p, r) = 0 : 

1) if there exists a Ar > 0 such that mm{Mark{p^,z^)} =Q holds for all 

i 

z<z^<z + Az , then Definition 3 implies (3); 

2) else there exists no Ar>0 such that min{Mark(/7;,ri)} = 0 holds for all 

i 

z<z,<z + Az . As the limiting case of Mark(pj,z) >0 i = 1,2, ... ,n , inequality 
(3) still holds in this case. □ 

It is impossible to represent the equations in Proposition 2 as max-plus algebraic 
linear form, since there have some multiplication operations that are exponential 
operations under max-plus algebra. 



4 Dioid Representation of a Class of CEGs 

M is a collection of functions, VT e M Z : R — > , and T is monotone 

nondecreasing. A minimization operation © in R is defined by 



Vrj,r2 gR r^®r2= min(r,,r2) . 



( 6 ) 
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Define the operation 0 in M as point wise minimization 

X^{T)® X^{T) = mm[X^{T),X2{T)) . (7) 

Obviously, V/l e M 0 © A(r) - A(r) and OgM . 

Proposition 3. (M is a monoid, the identity element is 0, © is commutative and 
idempotent. 

Proposition 4. V{d,JcM © /I,, (r) e M , namely M is complete. 

i 

A partial-order < is induced from © as follow: 

Vdj,/l 2 e /lj(r) < /^(r) /Ij (r) © /I 2 (r) = /I 2 (r) . (8) 



Definition 5. Mapping f \ M is said to be lower-semicontinuous (l.s.c.) [3] if 
V{A,}cM ©/(A,,(r))=/(©2,(r)). 

Proposition 5. All the l.s.c. mappings on M , denoted F , can be induced a dioid with 
two operations, © and 0 , defined as follow: 

V/i , /2 e F VA e M (/, © ){X) = /> {X) ® ffiX) (10) 

(/,0/2)(A) = /i(/2(A)). 



According to Proposition 4 and Proposition 5, the following holds. 

Corollary 1. F is Complete. 

For briefness, 0 may be omitted. 

Proposition 6. Let ksN and k>0 , veM* be a variable vector, veM* be 
constant one, and A g F**** , the least solution of equations x - Ax®v is x- A'v , 

00 

where A" - ®A' [1]. 

i=0 

Definition 6. \fX g M 

1) add mapping [r^] , r g [0,-i-oo) : [r^] A(t) = X{t) + r ; 

2) multiplication mapping [r"*] , r gR* \ [r"*] X{t) = r ■ A(t) . 

The symbol [ ] may be omitted if no ambiguity. Note that 0^ and L are both the 
identity element of < M,0 > . 

Let = {[r^] | r e[0,-i-oo)} , F 2 = {[r’'] | r g R^) , and 



i=\ 7=1 



fa 



:F;UF’2 



i, j, m,nGN) . 



( 11 ) 



Theorem 1. F^ gF . 

Proof. Because all the elements of F^ U F^ is l.s.c. . □ 
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Given a class of CEGs satisfied 



\ftGl max{K(t,.) 



W{p„t) 



ti e Pi=ti°}<V{t) . 



( 12 ) 



Theorem 2. Performances of the class of CEGs above mentioned could be described 
as linear algebraic equations. 

Proof. Considering a transition t;. e , given that = 1 e °(\)} c , as 



above assumed. Let v, (r) = V{t,)r , p^ = = 



W{p,„t,y " W{p,„t,) 



The following equation holds. 



MtAT) = \mn{V{tf}T,- 



M,yO) + M,jT)-W(f,,p,A 



W(PmA) 



}• 



( 13 ) 



In fact, according to (12), on the one hand once a transition satisfied 
Markfi^j , z"o ) = 0 , V r > z"q s.t. Markfi^f , r) = 0 ; On the other hand, let 
Tq = min{r | — Mark{ti^i,T) = 0} , then Tq =0.(13) can be represented as follow: 

(r) = A) © V, . (14) 

Let x = (ti,t 2 and v = (vi, Vj ,-,v^jy . (14) imply 

X = Ax®v ( 15 ) 

where A = (Uy) is a mapping matrix, and . 

Corollary 2. The least solution to (15) is x - A*v . 



5 Example of Optimal Control 



A manufacturing system represented by a CEG is shown in Fig. 5, where 
V(t,)-l,VitA-l, V(tA-l, F(tJ = l, M^_(0) = 1, M^^(0) = 0, M^^(0) = 1, 

(0) = 1 , Mp^ (0) = 1 and M (0) = m . 

Using Theorem 2, the following equations holds 



M(t) := 



M^t) 




m 


u 


m 


m 




M,yr) 






m 


+ 

m 


m 


M,yz) 


M^t) 




m 


U 


m 


r 


M,yr) 


Mt, (^)_ 




m 


U 


m 


m 





( 16 ) 



:= AM{t) © V 
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Fig. 3. Example of represented a CEG by dioid Equations 



According Corollary 2, the least solution to (16) is 



"o^ 


r 


{m + \Y 


{m + ly 


T 


r 


0^ 


+ 

m 


{m+vy 


T 




r 


0^ 


r 


T 


T 


r 


(w + 1)^ 


0^ 


T 



(17) 



The solution indicate that m are cannot affect the firing velocity of transition , 
but that of tj , and . 




6 Conclusion 

In this note, a linear algebraic model {F,®,®') for performance evaluation of a class 

of Continuous event graph has been developed. And an interesting and useful result of 
the example gives us a novel approach to compute the minimum (optimal) initial 
tokens of places in a CEG by its algebraic model. 
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Abstract. Massively parallel computing systems are being built with 
thousands of nodes. Because of the high number of components, it is 
critical to keep these systems running even in the presence of failures. 
Interconnection networks play a key-role in these systems, and this paper 
proposes a fault-tolerant routing methodology for use in such networks. 
The methodology supports any minimal routing function (including fully 
adaptive routing), does not degrade performance in the absence of faults, 
does not disable any healthy node, and is easy to implement both in mes- 
hes and tori. In order to avoid network failures, the methodology uses a 
simple mechanism: for some source-destination pairs, packets are forwar- 
ded to the destination node through a set of intermediate nodes (without 
being ejected from the network). The methodology is shown to tolerate 
a large number of faults (e.g., five/nine faults when using two/three in- 
termediate nodes in a 3D torus). Furthermore, the methodology offers a 
gracious performance degradation: in an 8 x 8 x 8 torus network with 14 
faults the throughput is only decreased by 6.49%. 



Keywords: fault-tolerance, direct networks, adaptive routing, virtual channels, 
bubble flow control. 

1 Introduction 

There exist many compute-intensive applications that require a huge amount of 
processing power, and this required computing power can only be provided by 
massively parallel computers. Examples of these systems are the Earth Simulator 
[12], the ASCI Red [1], and the BlueGene/L [2]. 

The huge number of processors and associated devices (memories, switches, 
links, etc.) significantly increases the probability of failure. It is therefore critical 
to keep such systems running even in the presence of failures. Much work deal 

* This work was supported by the Spanish MCYT under Grant TIC2003-08154-C06- 

01 . 
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with failures of processors and memories. In this paper, we consider failures in 
the interconnection network. These failures may isolate a large fraction of the 
machine, wasting many healthy processors that otherwise could have been used. 
Therefore, fault-tolerant mechanisms for interconnection networks are becoming 
a critical design issue for large massively parallel computers. 

There exist several approaches to tolerate failures in the interconnection net- 
work. The most prominent technique in commercial systems consists of replica- 
ting components. The spare components are switched on in the case of failure 
while switching off (or bypassing) the faulty components. The main drawback 
of this approach is the high extra cost of the spare components. Another power- 
ful technique is based on reconfiguring the routing tables in the case of failure, 
adapting them to the new topology after the failure [5]. This technique is ex- 
tremely flexible, but this flexibility may also kill performance. However, most 
of the solutions proposed in the literature are based on designing fault-tolerant 
routing algorithms able to find an alternative path when a packet meets a fault 
along the path to its destination. Most of these fault-tolerant routing strategies 
require a significant amount of extra hardware resources (e.g., virtual channels) 
to route packets around faulty components depending on either the number of 
tolerated faults [9] or the number of dimensions in the topology [17]. Alter- 
natively, there exist some fault-tolerant routing strategies that use none or a 
very small number of extra resources to handle failures at the expense of provi- 
ding a lower fault-tolerance degree [9,14], disabling a certain number of healthy 
nodes (either in blocks (fault regions) [6,7] or individually [10,11]), preventing 
packets from being routed adaptively [15], or drastically increasing the latencies 
for some packets [19]. Moreover, when faults occur, link utilization may become 
significantly unbalanced when using those fault-tolerant routing strategies, thus 
leading to premature network saturation, and consequently, degrading network 
performance even more. 

In [13] we proposed a fault-tolerant routing methodology for n-dimensional 
meshes and tori, and that only requires one extra virtual channel. In order to 
avoid network failures, an intermediate node is used for some source-destination 
pairs.^ This node is selected in such a way that the faults are avoided when the 
packets are routed first to the intermediate node and then from this node to the 
destination node. However, in order to tolerate an acceptable number of faults, 
an additional mechanism is used, that is, disabling adaptive routing for some 
paths (i.e., routing packets deterministically). 

Disabling adaptivity has two main drawbacks: The first is that it has a ne- 
gative impact on network performance, because it prevents packets from being 
adaptively routed. The second is that it needs additional complexity at the rou- 
ters in order to enable turning off adaptivity on a per packet basis. For these 
reasons it would be beneficial to use a single mechanism only. In this paper 
we propose a methodology solely based on intermediate nodes, but instead of 
using only one intermediate node, we propose to use several ones in order to cir- 

® Intermediate nodes were introduced by Valiant [21] for other purposes, such as traffic 
balance. 
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cumvent faulty components. This way, regardless of the number of intermediate 
nodes being used, the way packets are being routed does not need to be modified, 
allowing the same router design as in the absence of intermediate nodes to be 
used. Furthermore, this methodology allows all packets to be adaptively routed, 
which again contributes to a good network performance. 

On the other hand, this approach requires using additional virtual channels 
as long as more intermediate nodes are used. However, virtual channels are 
nowadays inexpensive. Current interconnects are able to provide several virtual 
channels. This is the case for the Cray T3E [20] with five virtual channels, the 
BlueCene/L [2] with four virtual channels, and InfiniBand switches [16] with 16 
virtual channels. 

Still, when designing a fault-tolerant routing scheme that requires extra vir- 
tual channels, it is desirable to use a bounded number of virtual channels. At the 
same time one should also tolerate a reasonably large number of faults, avoid 
disabling any healthy node, maintain a low router complexity, and guarantee 
routing through adaptive paths in order to provide high network performance 
both in the absence and in the presence of faults. 

The rest of the paper is organized as follows. In Sect. 2, the methodology is 
presented. The methodology is then illustrated through some example scenarios 
in Sect. 3. In Sect. 4, the routing algorithm obtained by the methodology is 
analyzed in terms of performance and fault-tolerance. Finally, in Sect. 5, some 
conclusions are drawn. 



2 The Methodology 

The methodology for achieving fault-tolerance through the use of one or more 
intermediate nodes will now be presented. We will assume a k-ary n-cube (to- 
rus) or n-dimensional mesh network. The methodology is valid for any minimal 
routing function, although it is applied to minimal adaptive routing [18] in this 
paper. Minimal adaptive routing with v virtual channels allows the use of any 
minimal path through u — 1 virtual (adaptive) channels, whereas the last channel 
(i.e., the escape channel) uses deterministic routing. Thus, at least two virtual 
channels per physical channel (v = 2) are required. In a torus the escape channel 
also uses the bubble flow control mechanism [4]. 

Furthermore, a static fault model is assumed. This means that when a fault is 
discovered all the processes are stopped, the network is emptied, and a manage- 
ment application is run in order to deal with the fault. Checkpointing techniques 
must also be used so that applications can be brought back to a consistent state 
prior to the fault occurred. Detection of faults, checkpointing, and distribution 
of routing info is assumed to be performed as part of the static fault model, and 
are therefore not further discussed in this paper. 

A fault-free path is computed by the methodology for each source-destination 
pair. In the presence of faults, those paths that may use some faulty components 
are not valid. The methodology avoids these faults by using intermediate nodes. 
Packets are first forwarded to the first intermediate node, then from this node 
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to the second one, and so forth until the packet reaches its final destination. As 
shown in Fig. 1, the use of intermediate nodes reduces the number of possible 
paths, and therefore enables avoiding areas containing faults. The original rou- 
ting algorithm (e.g., minimal adaptive routing) is used in all subpaths. Notice 
that the packets are not ejected from the network at each intermediate node. 




Fig. 1. The use of intermediate nodes (/) limits the number of possible paths, 
from the source (S') to the destination (Z3), enabling faults (F) to be avoided 

Packets sent through intermediate nodes carry the address of each interme- 
diate node, in addition to the address of the final destination. As the packet 
reaches each intermediate node, the address of that intermediate node is remo- 
ved from the packet header, until the packet finally reaches its true destination. 
In addition, every source node must maintain a table specifying the interme- 
diate node(s) to be used for each destination that requires such measures to be 
taken. When there are several candidates for the intermediate node(s), one of 
the alternatives can be selected randomly or more than one alternative could be 
listed in order to provide additional routing flexibility. 

In what follows we will denote the source node as S and the destination 
node as D. The intermediate nodes are denoted Ix, where Ii refers to the first 
intermediate node in a route. Faulty links are denoted as Fj. A node failure can 
easily be modelled as the failure of all the links of a node. 

Deadlock freedom is ensured by having a separate escape channel for each 
phase. E.g., with two intermediate nodes, one escape channel is used (if required) 
from S' to /i, another from Ii to I 2 , and a third one from I 2 to D. This way, each 
phase defines a virtual network, and the packets change virtual network at each 
intermediate node. Although each virtual network relies on a different escape 
channel, they all share the same adaptive channel(s). If y is the allowed number 
of intermediate nodes for each source destination pair, and the minimal adaptive 
routing algorithm uses one adaptive and one escape channel per physical channel, 
the methodology requires a total of y + 2 virtual channels. Notice that one of 
them corresponds to the escape channel used in the minimal adaptive routing 
algorithm. So, for two intermediate nodes, four virtual channels are required. 
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The escape channels use deterministic Dimension Order Routing (DOR) with 
the bubble flow control mechanism. With this mechanism, a packet that is injec- 
ted into the network or cross a network dimension requires two free buffers (i.e., 
one for the packet itself and one additional free buffer) to guarantee deadlock 
freedom. Hence, in order to avoid deadlocks, a packet changing virtual network 
at an intermediate node should be considered as crossing a dimension, and the- 
refore requires two free buffers. 

The computational complexity for identifying one intermediate node is 0(1) 
in torus and mesh topologies. For all the paths in the network the computatio- 
nal complexity thus becomes O(n^). When using two intermediate nodes this 
increases to O(n^) in the worst case. However, as we will see in Sect. 4, the 
number of paths using more than one intermediate node is very low even when 
there are many faults (e.g., 0.000001% of the paths in a 3 x 3 x 3 torus with six 
faults). Thus, the methodology has a low computational cost, especially when 
considering that a static fault model is used. 

Next, a methodology for identifying the intermediate nodes is presented. 
First, the case where only one intermediate node is used is presented. We then 
show how the method can be extended to the use of multiple intermediate nodes. 



2.1 One Intermediate Node 

When at most one intermediate node is used for each source-destination pair, 
the intermediate node I\ should have the following properties so that the fault(s) 
Fi are avoided when routing packets from S via I\ to D: 

1. /i is reachable from S. 

2. D is reachable from Ii. 

3. There is no /( giving a shorter path than Ii. 

The first requirement guarantees that packets can be routed from S' to /i, 
and the second requirement guarantees that packets can be routed from I\ to 
D. The third requirement guarantees that the final path is the shortest possible. 

Also notice that, when minimal adaptive routing is used, a node N 2 is re- 
achable from a node Ni if and only if: For all i, Fi is not on any minimal path 
from Ni to N 2 - 

To identify the possible intermediate nodes, let Trs be the set of nodes re- 
achable from S and To the set of nodes from which D is reachable. Furthermore, 
let l{x, y) be the length of the minimal path, in the fault free case, from x to y. 
We then define 7) (for j > 0) in the following way: A node N is in Tj if, and 
only if, 1{S, N) + 1{N, D) = 1{S, D) + j. 

This way, 7) defines non-overlapping sets of nodes, as shown in figure 2. 
These sets can easily be identified by starting with the nodes that are reached 
(i.e., traversed) on any minimal path from S to D (i.e., j = 0), and continuing 
outwards. As can be seen in Figure 2, the sets 7) are non-empty only for even 
values of j. This is always the case for meshes, but not always for tori (due to 
the wraparound links). 




346 



N.A. Nordbotten et al. 




Fig. 2. The nodes in the sets 7^-, for j < 4 in a 2D mesh 



Theorem 1. Let j' be the smallest j for which Tj H Tns H To is non-empty. A 
node N fulfills all three requirements, if and only if N G Tji D Tus H To- 

Proof. We prove the theorem by induction. The theorem is true for j = 0 (i.e., 
when a minimal route exists): 

~ Let us assume that there is one node N in the set that does not fulfill the 
requirements. Then N would either have to be unreachable from S, not 
have a valid route to D, or not be on a minimal path from S' to D. If 
is unreachable from S it is by definition not in Trs- If ^ does not have a 
valid route to D it is by definition not in To . If is not on a minimal path 
from S to it is by definition not in %. Because of the properties of set 
intersection, N must be in all the three sets Trs, Td, and % to be in the 
set To n Tfls n Td. Thus, we have a contradiction. 

~ Let us then assume that there is one node N, outside the set, which fulfills 
the requirements. N would then have to be outside at least one of the sets 
Trs, Ti), or Tq. If iV is outside T^s it is unreachable from S and therefore 
does not fulfill requirement one. If N is outside Td it has no valid route to D 
and therefore does not fulfill requirement two. If N is outside Tq it violates 
our assumption that a minimal route exists (i.e., that j = 0). Thus, we have 
a contradiction in all three cases. 

If the theorem is true for j = m, then the theorem is also true for j = to + 1: 
Concerning requirements one and two, the arguments made for j = 0 also hold 
for j = TO+1. Furthermore, when j = to+ 1, no route S-I\-D exists for j < to+ 1. 
Indeed, as each increase of j adds one additional hop to the path S-I\-D, all the 
intermediate nodes found when j = to + 1 yield paths S-Ii-D of equal lengths. 
Finally, for the same reason, no shorter path can be found for j > to + 1. The 
theorem, therefore, fulfills all three requirements. □ 

This way, we start considering the minimal paths (j = 0) and then, if neces- 
sary, non-minimal paths (j > 0) to avoid the fault(s). 
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2.2 Multiple Intermediate Nodes 

In cases where one intermediate node is insufficient, to avoid all the faults when 
routing from S to D, two or more intermediate nodes can be used. The use of 
multiple intermediate nodes may also enable shorter paths than those otherwise 
obtained when using fewer intermediate nodes. 

We will now first present a methodology for using two intermediate nodes. 
We then generalize this methodology so that it can be used, in a recursive way, 
for any number of intermediate nodes. 

Two Intermediate Nodes. When using two intermediate nodes, we are loo- 
king for intermediate nodes Ii and I2 so that: 

— I\ is reachable from S. 

— I2 is reachable from Ii. 

— D is reachable from l2- 

— There are no /( and giving a shorter path than S-I1-I2-D. 

However, it can be observed that if a suitable I\ is identified, then the second 
intermediate node I2 follows from Theorem 1 . Thus, the problem can be reduced 
to identifying Ii. 

In order to solve this problem, let us introduce a variation of To, namely 
Tj^^. We define this new set as the set of nodes that can reach D through one 
intermediate node (i.e., the 1 in the subscript denotes that one intermediate node 
is used). This intermediate node is given by Theorem 1 , and k here represents 
the j in the set Tj used, with Theorem 1 , for identifying it. E.g., the set Tj^^ 
consists of the nodes that have minimal path, via one intermediate node, to D. 
The set T^^, on the other hand, consists of nodes which have a path length equal 
to the minimal path plus one, via one intermediate node, to D. As before, Tjis 
denotes the nodes reachable from S. 

Theorem 2. Let j' and k' be the smallest j and k (i.e., so that their sum is 
minimized) for which Tj n T^s H Tj).^ is non-empty. A node N fulfills all four 
requirements if, and only if, N G Tji C] Trs H . 

Proof. Let us define I as the sum of j and k, i.e., I = j -\-k. We then prove the 
theorem by induction. The theorem is true for I = 0 (i.e., when a minimal path 
exists) : 

— Let us assume that there is one node N in the set % n Trs H that 
gives a path S-N-I2-D that does not fulfill the requirements. It follows from 
Theorem 1 , and the definition of Tjf.^^, that I2 is reachable from N and that 
D is reachable from I2. Thus, N must be unreachable from S or the path 
S-N-I2-D is not the shortest possible. If N is unreachable from S, N is by 
definition not in Trs. It also follows from Theorem 1 that the subpath N- 
I2-D is the shortest possible. Thus, N can not be on a minimal path from 
S to D for the path S-N-I2-D to be a non-minimal path. However, then N 
is by definition not in Tq. Therefore, we have a contradiction. 
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— Let us then assume that there is one node N outside the set % n Trs n ^ 
that fulfills the requirements. N would then have to be outside at least one 
of the sets %, Trs, or If is outside Tq it violates our assumption that 
^ = 0. If is outside T^s, it is unreachable from S and therefore violates 
requirement one. If N is outside of the set violates requirements two 

or three, or our assumption that ^ = 0. 

If the theorem is true for I = m, then it is also true for I = m + 1: As for 
reachability, the same arguments as for I = 0 are still valid. Thus, it only remains 
to be shown that the path S-N-I 2 -D is the shortest possible. By definition, when 
I = m+1, no N exists for ? < m + 1. Each increase of I adds one hop to the path 
S-N-I 2 -D. Thus, all paths where I = m + 1 are of equal length, and no shorter 
path can be found for I > m + 1. □ 

Thus, as before, we start considering the minimal paths (i.e., j + k = 0) and 
then consider non-minimal paths (i.e., j + k > 0), if necessary, to avoid all the 
faults. 



Any Number of Intermediate Nodes. Let us now generalize the definition 
of in order to apply Theorem 2 for any number of intermediate nodes. We 
therefore define in the following way: 

— The set of nodes from which D is reachable without the use of any 
intermediate node (i.e., the set of nodes defined by the original set To). 

~ for z > 0 and fc > 0: The set of nodes given by 7^/ n Trs n where 
z = z' + 1 and k = f + k' . 

Thus, \s the set of nodes that reach D through z intermediate nodes, 
and where k is the number of additional hops, in the path to D, compared to 
the minimal path. When paths of equal length exist, preference should be given 
to paths with fewer intermediate nodes. 

Notice that the set Tj n Trs n T^q is actually the same as that in Theorem 
1, and thus results in paths with one intermediate node. The set Tj n Trs n Tj^^ 
is that given by Theorem 2, resulting in paths with two intermediate nodes. 
Similarly, Tj n Trs H Tj^^ gives paths with three intermediate nodes. Continuing 
this way, an arbitrary number of intermediate nodes can be obtained. 

3 Example Scenarios 

We will now illustrate the methodology through two example scenarios. A 2D 
mesh is used for this purpose, although the methodology is also valid for other 
topologies such as a 3D mesh or torus. For both scenarios we assume that mi- 
nimal adaptive routing is used, and that at most two intermediate nodes are 
allowed in each route. 

Figure 3a shows a scenario with three faults. Because there are faults present 
in some of the minimal paths between S and D, an intermediate node is needed. 
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In order to find a minimal path, we look for an intermediate node within Tq. As 
shown in Fig. 3a, there are several nodes within 7 q that are either reachable from 
S', or able to reach D. However, we are only interested in nodes with both of 
these attributes, i.e., the nodes given by the set 7q n Trs H In this scenario 
there is only one such node, i.e., the one identified as a possible intermediate 
node in the figure. By using this node as the intermediate node, it is guaranteed 
that the faults are not encountered when packets are routed first from S to I\ 
and then from I\ to D. 




LI Node in LiNode in Node in Trs X Failure 



Fig. 3. (a) The faults are avoided by the use of one intermediate node. The 
shaded area identifies the nodes in Tq. (b) Two intermediate nodes must be used 
in order to avoid the faults. The figure shows how the first of these intermediate 
nodes (i.e., Ii) is identified. The shaded areas identify the nodes in T 2 



Figure 3b shows the same fault scenario as in the previous example, except 
that the source node is different. In this case, all the minimal paths between S 
and D are blocked by faults. The set % n Trs n T^g, giving minimal paths with 
one intermediate node, is therefore empty. The set ToIHTrsHT^j^, giving minimal 
paths with two intermediate nodes, is also empty. Because preference is given 
to the paths with the least number of intermediate nodes when the path length 
is equal, we then try to find an intermediate node within T 2 (because this is a 
mesh we are only interested in the even values of j) giving a non-minimal path 
with one intermediate node. However, this set, 72 n Trs H T^q, is also empty. 

There are now two more sets giving the same path lengths as the previous 
one, but using two intermediate nodes instead of one. Which of these two sets are 
given preference is irrelevant for the correctness of the methodology as they both 
give the same value for j + k (which should be minimized according to Theorem 
2). Increasing j means adding one hop to the path. S-I\^ while increasing k adds 
one hop to the path I 1 -I 2 -D. 
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Anyway, of the two sets, the set Tq n Trs n is empty, while H Trs H 
gives us the possible intermediate nodes shown in Fig. 3b. Thus, the first 
intermediate node, Ji, can be selected among these three nodes. If I[ is the first 
intermediate node, then the second intermediate node, I 2 , can be selected among 
the intermediate nodes that give I[ a path with one intermediate node to D. In 
this case, I 2 would be the same as the one identified as Ii in the first example. 



4 Evaluation of the Methodology 

In this section, we evaluate the proposed methodology. In a first study, we ana- 
lyze the fault-tolerance properties of the methodology, i.e., how many faults 
the mechanism is able to tolerate. The methodology is n— fault tolerant if it is 
able to tolerate any combination of n failures. A given combination of failures, 
again, is tolerated if the methodology is able to provide a valid path for every 
source-destination pair in the network. On the other hand, faults can physically 
disconnect some nodes in the network. In this situation, disconnected nodes are 
not taken into account and, provided that the paths for the remaining nodes can 
be computed, the fault combination is considered as tolerated. 

Then, we evaluate how the methodology influences network performance. 
For this, network throughput has been measured for different numbers of faults. 
For each number of faults, 50 randomly generated fault combinations have been 
simulated, and the average network throughput for these combinations is provi- 
ded. 

We have applied the methodology to 3 x 3 x 3 (27 nodes) torus and mesh 
networks, to a 3 x 3 (9 nodes) torus network, and to an 8x8x8 (512 nodes) 
torus network. Although actual systems are built with larger topologies (e.g., 
a 32 X 32 X 64 torus for the BlueGene/L), smaller networks can be evaluated 
exhaustively from a fault-tolerant point of view and the results can then easily 
be extended to larger networks. 



4.1 Simulation Model 

A detailed event-driven simulator has been used to analyze the performance 
exhibited by the proposed methodology. The simulator models a direct inter- 
connect network with point-to-point bidirectional serial links. Each router has 
a non-multiplexed crossbar with queues only at the input ports. Each physical 
input port uses five virtual channels, each providing buffering resources in order 
to store up to two packets. A round-robin policy has been chosen to select among 
packets contending for the same output port. 

In order to make a fair evaluation, the same number of virtual channels (i.e., 
five) is used regardless of the number of intermediate nodes used in the metho- 
dology. Virtual channels are used as adaptive or as escape channels, depending 
on the number of required intermediate nodes. If paths use at most one inter- 
mediate node, three virtual channels are used for adaptive routing, whereas the 
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remaining two virtual channels are used for the escape paths. For two inter- 
mediate nodes, there are two adaptive channels and three are escape channels, 
and so on. When faults are not present (i.e., when no intermediate nodes are 
required), four adaptive channels are used. In the escape channel(s), packets are 
deterministically routed following the DOR routing and the bubble flow con- 
trol mechanism. Notice that for a given number of intermediate nodes, all the 
paths in the network will have the same number of adaptive virtual channels, 
regardless of whether they use intermediate nodes or not. 

For each simulation run, we assume that the packet generation rate is con- 
stant and the same for all of the nodes. The destination of a message is randomly 
chosen with the same probability for all the nodes. This pattern has been widely 
used in other evaluation studies [3,8]. In all the simulations, the packet length 
is set to 128 bytes. 



4.2 Fault Analysis Models 

For a reduced number of faults in the network, all the possible combinations of 
faults can be explored. However, as the number of faults increases, the number of 
possible fault combinations increases exponentially. Therefore, from a particular 
number of faults, it is impossible to explore all the fault combinations in a 
reasonable amount of time. We tackle this problem with two approaches. In the 
first approach, we focus on faults bounded into a limited region of the network. 
Notice that the worst combinations of faults to be solved by the methodology are 
those where the faults are closely located. This is because the number of fault-free 
paths in that region is reduced. Because the number of fault combinations within 
such a region is much lower than for the entire network, all the fault combinations 
can be evaluated. Although the results obtained cannot be directly extended to 
the generic case, where faults may be located over the entire network, it gives 
us an approximation of the effectiveness of the methodology in the worst case. 

For this, we must define the region where the faults are to be located. The 
region is formed by all the links attached to the nodes that are one hop away 
from a node (the center node). Therefore we refer to this as a distance 1 region, 
and it consists of 36 links. However, in a 3 x 3 x 3 torus it only consists of 33 
links, as three of the links then are shared by nodes within the region. The center 
node is randomly selected.^ Notice that with a high number of faults and for 
a large number of fault combinations, the center node is hardly accessible, as 
very few links are non-faulty. So, the distance 1 region actually represents a real 
worst case to access the center node. 

In the second approach, a statistical analysis is performed, analyzing a subset 
of the fault combinations, where the faults are randomly located over the entire 
network. From the obtained results, statistical conclusions are extracted about 
the fault-tolerance degree of the proposed methodology. 

^ The selection of the center node does not affect the results in a torus network due 
to the symmetry of the topology. 
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Table 1. Fault tolerance achieved by the methodology when using at most one 
(/ X 1), two (/ X 2), or three (/ x 3) intermediate nodes in a 3 x 3 x 3 torus 
network. The three rightmost columns show the percentages of the paths that 
use each number of intermediate nodes when at most three intermediate nodes 
are used 



Link 


Analysis 


Combinations 


Not tolerated combinations 


1x3 Paths using ^ I 


faults 


type 


analyzed 


Ixl 


1x2 


1x3 


1 


2 


3 


1 


Exhau- 


81 


0% 


0% 


0% 


6.86% 


0% 


0% 


2 


stive 


3,240 


2.50% 


0% 


0% 


12.99% 


0.04% 


0% 


3 




85,320 


7.44% 


0% 


0% 


18.46% 


0.13% 


0% 


4 




1,663,740 


14.67% 


0% 


0% 


23.32% 


0.31% 


0% 


5 




25,621,596 


24.06% 


0% 


0% 


27.62% 


0.56% 


0% 


6 




324,540,216 


35.49% 


0.0002% 


0% 


31.41% 


0.90% 


0.000001% 


6 


Dist.l 


1,107,568 


54.52% 


0.01% 


0% 


28.09% 


1.19% 


0.00003% 


7 




4,272,048 


70.31% 


0.06% 


0% 


30.41% 


1.78% 


0.0004% 


8 




13,884,156 


83.30% 


0.31% 


0% 


32.25% 


2.51% 


0.002% 


9 




38,567,100 


92.15% 


1.06% 


0% 


33.67% 


3.38% 


0.008% 


10 




92,561,040 


96.97% 


2.99% 


0.001% 


34.71% 


4.36% 


0.02% 


11 




193,536,720 


99.01% 


6.51% 


0.01% 


35.42% 


5.44% 


0.05% 


12 




354,817,320 


99.67% 


12.88% 


0.62% 


35.84% 


6.58% 


0.11% 


6 


Stati- 


10,000,000 


35.46% 


0.00% 


0% 


31.41% 


0.90% 


0.000001% 


7 


stical 


10,000,000 


48.72% 


0.00% 


0% 


34.72% 


1.34% 


0.00001% 


8 




10,000,000 


62.98% 


0.01% 


0% 


37.61% 


1.88% 


0.00007% 


9 




10,000,000 


76.51% 


0.03% 


0% 


40.10% 


2.53% 


0.0002% 


10 




10,000,000 


87.40% 


0.09% 


0% 


42.21% 


3.29% 


0.0008% 


11 




10,000,000 


94.47% 


0.23% 


0% 


43.98% 


4.16% 


0.002% 


12 




10,000,000 


98.05% 


0.52% 


0.00001% 


45.44% 


5.14% 


0.005% 


13 




10,000,000 


99.46% 


1.10% 


0.0003% 


46.60% 


6.22% 


0.01% 


14 




10,000,000 


99.88% 


2.13% 


0.0009% 


47.50% 


7.41% 


0.02% 



4.3 Evaluation Results 

Table 1 shows the fault tolerance achieved by the methodology for a 3 x 3 x 3 
torus network. The table shows the results for the three different types of analysis 
performed (exhaustive, distance 1, and statistical). From the exhaustive analysis 
results, we can observe that the methodology is only 1-fault tolerant when using 
only one intermediate node. For two faults present in the network, 2.5% of the 
fault combinations are not tolerated when using one intermediate node. As the 
number of faults increases, the percentage of not tolerated combinations grows 
fast. For six faults, 35.49% of the fault combinations are not supported when 
using one intermediate node. 

By using two intermediate nodes, the methodology greatly increases its fault 
tolerance degree. In particular, it is 5-fault tolerant as all the fault combinations 
of up to and including five faults are tolerated. With six faults in the network, 
two intermediate nodes where sufficient for almost all the fault combinations, 
except for 0.0002% of the combinations. 
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With three intermediate nodes, the methodology achieves a very good fault 
tolerance degree. From the exhaustive analysis results, we observe that using 
three intermediate nodes allows tolerating all the possible fault combinations up 
to and including six faults. In the statistical analysis, where 10 million randomly 
generated fault combinations were analyzed for up to 14 faults®, the methodology 
could provide a valid path for every non-disconnected pair of nodes for up to 
11 faults. For 12 faults one not tolerated combination was found (in 10,000,000) 
and this number increased to 89 when 14 faults were present. 

However, taking into account the distance 1 analysis (representing the worst 
case situation) we can observe that with 10 faults in the network there were 
some not tolerated combinations. Therefore, the methodology is not 10-fault 
tolerant. Even, from seven up to and including nine faults we can not deduce for 
sure that the methodology is n-fault tolerant since not all the fault combinations 
have been tested. However, this strongly indicates that the methodology tolerates 
nine faults. Anyway, even with a high number of faults, the percentage of not 
supported fault combinations is very low when using three intermediate nodes. 
So, the methodology achieves a high fault-tolerance degree. 

Table 1 also shows the percentage of paths that use a certain number of 
intermediate nodes. As can be seen, most of the paths avoid faults by using 
just one intermediate node, and very few paths need a third intermediate node. 
Notice that although the third intermediate node is little used, it makes a large 
difference for the fault-tolerance degree. 

Table 2 shows the results achieved for a 3 x 3 torus network and for a 3 x 
3x3 mesh network. In the 3x3 torus, all the combinations of up to and 
including six faults (i.e., 1/3 of the total number of links) have been exhaustively 
analyzed. The methodology tolerates one fault when using one intermediate 
node, three faults when using two intermediate nodes, and five faults when using 
three intermediate nodes. So, the fault-tolerance degree in a 2D torus is lower 
than in a 3D torus. This is not unexpected considering that a 2D torus provides 
lower routing flexibility. 

For the 3x3x3 mesh network, the results are not as good as for the torus 
networks. The methodology requires at least two intermediate nodes in order to 
be 1-fault tolerant.® When using three intermediate nodes, the methodology is 
4-fault tolerant, and it is 6-fault tolerant when using four intermediate nodes. 

Finally, Fig. 4 shows the performance degradation exhibited by the metho- 
dology in an 8 X 8 X 8 torus network when up to two intermediate nodes are used. 
Notice that in a larger network, like the one used in the performance analysis, 
the percentage of not tolerated fault combinations, when using two intermediate 
nodes, is much lower than in a 3 x 3 x 3 torus. Thus, all the randomly generated 
combinations for the performance evaluation could be solved by the use of at 
most two intermediate nodes. When only one fault is present in the network, 

® The error due to not analyzing all the combinations is lower than 0.05. 

® Notice that in a mesh the wraparound links do not exist and it is therefore impossible 
to communicate to a node on the direct opposite side of the fault without using at 
least two intermediate nodes (i.e., when S, F, and D are in the same row/column). 
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Table 2. Fault tolerance degree achieved by the methodology for a 3 x 3 torus 
network and a 3 x 3 x 3 mesh network. The table shows the percentage of the total 
number of combinations that are not tolerated. The results have been obtained 
by exhaustively analyzing all the possible fault combinations 
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0.002% 
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N/A 


N/A 


N/A 


100% 


64.53% 


2.83% 


0.02% 



only one intermediate node is used. The figure shows for every number of faults, 
the average throughput achieved. The presented throughput is the average of 
the individual results obtained when evaluating the 50 randomly generated fault 
combinations.^ As can be observed, the throughput decreases as the number of 
faults in the network increases. However, the decrease in throughput, is very 
low. In particular, when there are 14 faults, the throughput is on average only 
decreased by 6.49% compared to the fault-free case (from 474 flits/cycle to 443 
flits/cycle). In particular, this degradation is lower than the one obtained with 
the methodology proposed in [13], where with 5 virtual channels the degradation 
from the fault-free case to 14 faults was 11.02%. 



5 Conclusions 

In this paper we have proposed a fault-tolerant routing methodology based on 
the use of intermediate nodes. The proposed methodology can be applied with 
any minimal routing function in n-dimensional mesh and torus networks, and 
it has been applied with minimal adaptive routing in this paper. The main 
advantage of the proposed mechanism is its simplicity, since the same original 
routing (e.g., minimal adaptive) continues to be valid. The only requirement 
on switches is that they should provide the required number of extra virtual 
channels. However, only a low number of virtual channels is required. 

The paper provides the necessary and sufficient conditions, for selecting the 
intermediate nodes, in order to tolerate as many faults as possible and to provide 
the shortest paths possible. 

The methodology has been shown to be five fault-tolerant when using two 
intermediate nodes in a 3D torus network. When using three intermediate nodes, 

^ The 95% confidence intervals are always smaller than 0.796. 
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Fig. 4. Overall throughput (flits/cycle) for the proposed methodology in an 8 x 
8x8 torus network. Five virtual channels are used 



the method is nine fault-tolerant for 3D torus networks, five fault-tolerant in 2D 
torus, and four fault-tolerant in 3D mesh topologies. 

Regarding performance, the methodology does not degrade performance in 
the absence of faults, whereas in the presence of faults it provides a gracious 
performance degradation. Specifically, it has been shown that the average per- 
formance degradation, in an 8 x 8 x 8 torus network with 14 faults, is only 
6.49%. 
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Abstract. In this paper, an extended DBP (EJDBP) scheme is studied 
for (m,k)-firm constraint. The basic idea of the proposed algorithm takes 
into account the distance to exit a failure state, which is a symmetrical 
notion of distance to fall into a failure state in DBP. Quality of Service 
(QoS) in terms of dynamic failure and delay is evaluated. Simulation 
results reveal the effectiveness of EJDBP to provide better QoS. 



1 Introduction 

Real-time media servers for delivering audio/ video streams need to service hun- 
dreds and, possibly, thousands of applications, each with its own quality of ser- 
vice (QoS) requirements. Many such applications can tolerate the loss of a cer- 
tain fraction of the information requird from the server, resulting in little or no 
noticeable degradation in QoS [1] [2]. Consequently, loss-rate is an important 
performance measure for the QoS to many real-time media applications. We de- 
fine the term loss-rate as the fraction of packets in a stream either discarded or 
serviced later than their delay constraints allow (Deadline) [3]. 

One of the problems with using loss-rate as a performance metric is that 
it does not describe when losses are allowed to occur. For most loss-tolerant 
applications, there is usually a restriction on the number of consecutive packet 
losses that are acceptable. For example, losing a series of consecutive packets 
from an audio stream might result in the loss of a complete section of audio, 
rather than merely a reduction in the signal-to-noise ratio. 

A suitable performance metric in this case is a window-based loss-rate, i.e. 
loss-rate constrained over a finite range, or window, of any consecutive pack- 
ets. More precisely, an application might tolerate at most k-m packet losses for 
every k arrivals at the various service points across a network. Any service dis- 
cipline attempting to meet these requirements must ensure that the number of 
violations to the loss-tolerance specification is minimized (if not zero) across the 
whole stream. In another way, it is same meaning that at least m packets must be 
serviced before their deadline in any consecutive k packets. We refer to such QoS 
requirement as (m,k)-firm constraint. If less than m packets are serviced success- 
fully in any window k, it is said the application experiences a dynamic failure and 
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the current state is called as failure state. An approach called Distance-based 
Priority (DBP) based on (m,k)-firm idea has been proposed to schedule multi- 
ple packet streams competing for service on a single server each having its own 
(m,k)-firm constraints [4]. It has been showed in that when streams have same 
or different (m,k)-firm constraint requirement and are identical (i.e. with same 
packet transmission time distribution, same packet inter-arrival distribution and 
same deadline distribution), DBP is especially more beneficial to tightening the 
probability of dynamic failure than conventional scheduling scheme where all 
packets are serviced at the same priority level [4] . This idea is then generalized 
under the name weakly hard real-time to deal with real-time applications that 
allows some packet losses without violating the desired behaviors of application 
[5]. In this paper, we proposed an extended DBP (E_DBP) scheme to study 
(m,k)-firm based QoS, and compared the E_DBP and DBP for streams with 
(m,k)-firm in terms of probability of dynamic failure and delay. 

The rest of paper is structured as follows. Section 2 describes DBP scheme 
and some relative work. In section 3, E_DBP scheduling is proposed. In section 4, 
performance metric of QoS about dynamic failure and delay is evaluated through 
simulation in overloaded scenarios. Finally, we make some concluding remarks. 

2 DBP Scheduling on Steams with (m,k)-Firm 

We begin this section by defining the problem that we focus on real-time streams 
with (m,k)-firm . Then, how the DBP scheme works is described. The drawbacks 
are stated in the last. 

2.1 Problem Definition 

In order to define the real-time scheduling problem based on (m,k)-firm con- 
straint addressed as part of this paper, we introduce the following definitions: 



Application Model. As briefly mentioned in the introduction section, DBP 
scheduling is designed to study how to efficiently serve multiple streams un- 
der (m,k)-flrm constraints sharing a single server. This system is called multi- 
ple input queues on single server (MIQSS). This model can be used to study 
a large category of computer and telecommunication systems such as multiple 
tasks executing in a CPU, transmission of packets issued from multiple packet 
sources sharing a same transmission medium or network interconnection equip- 
ment (switch or router). The proposed model is for n loss-tolerant applications 
generating n packet streams Ti (t = 1, 2, . . . , n) that will be served by a single 
server. Each stream is formed by a source and a waiting queue, where a packet 
issued from a source waits until being chosen by the server. The server chooses 
packets at the head of queues according to its scheduling scheme. 

In such a model, scheduling scheme is of prime importance to provide not 
only (m,k)-flrm guarantee for each individual stream (end user’s point of view) 
but also good server utilization (system designer’s point of view). 
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Stream Characterization. A stream is characterized by a 3-tuple {Ci,Di,Ti), 
where Ci is the service time for a packet in stream Tj. There, it is assumed that 
all packets in r* have the same service time. For the purposes of this paper, where 
time is divided into fixed-sized slots, each and every packet can be serviced in 
one such slot. Deadline Di is the latest time a packet finishes its service. If a 
packet cannot be finished by Di, it will be discarded, which is called deadline 
miss or packet loss. Ti is the inter-arrival time between consecutive packets. 



Loss- Tolerance. This is specified by (m,k)-firm constraint, where m is the least 
number of the packets that should be transmitted successfully by their deadline 
for any window k of consecutive packet arrivals in the same stream. Otherwise, 
a stream experiences a dynamic failure. The rate at which a stream experiences 
dynamic failure is therefore a measure of how often the QoS falls below the 
acceptable level, which is defined as the probability of dynamic failure. 



Problem Statement. The problem addressed in this paper is to propose a more 
effective scheduling scheme than DBF to guarantee better (m,k)-firm based QoS 
for each stream in terms of dynamic failure and delay at the given resource. 



2.2 DBP Outline 

DBF was firstly put forward in [4] , as a dynamic priority assignment mechanism 
for streams with (m,k)-firm constraint in a MIQSS model, and targeted primarily 
at loss tolerant, real-time applications like multimedia. 

The basic idea of DBF algorithm is quite simple and straightforward: the 
closer the stream is to a failure state the higher its priority is. A failure state 
occurs when the stream’s (m,k)-firm requirement is transgressed, i.e., there is 
more than k - m deadline misses within the last k-length window. 

So for each stream t with (m,k)-firm constraint, the priority is assigned 
based on the number of consecutive deadline misses that will lead the stream to 
violate its (m,k)-firm constraint. This number of deadline misses is referred to as 
distance to fall into a failure state from the current state. Examining the recent 
history of t, one can do the evaluation of the distance. The key to dealing with 
it is the k-sequence. If the same distance occurs. Earliest Deadline First (EDF) 
will be adopted as adjunctive scheme. 

The k-sequence is a word of k bits ordered from the most recent to the oldest 
packet in which each bit keeps memory of whether the deadline is missed (bit= 
0) or met (bit=l). In this paper, the leftmost bit represents the oldest. Each new 
arriving packet causes a shift of all the bits towards left, the leftmost exits from 
the word and is no longer considered, while the rightmost will be 1 if the packet 
has met its deadline (i.e. it has been served before its deadline) or 0 otherwise. 

The priority is assigned by DBF to a packet at a given instant according to 
the distance of the current k-sequence to a failure state. By adding consecutive 
Os to the right side of k-sequence, we can evaluate the distance easily until a 
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failure state happens. If a stream has already been in a failure state (i.e., less 
than m Is in the k-sequence), the highest priority is assigned. 

Formally, according to [4], priority is evaluated as follows. Let sj = 

• ■ • ) denotes the state of the previous k consecutive packets of tj, lj{n, s) 

denotes the position (from the right) of the meet (or 1) in the Sj,then the 
priority of the {i + 1)*^ packet of tj is given by: 

^i+i ~ ~ ■^i) 4" 1 (1) 

We note that if there are less than n Is in s, lj{n,s) = kj + 1, the highest 
priority (17=0) will be assigned, this is normal as the stream is in a failure state. 

Examplel: a stream t\ with (3,5)-firm constraint, current k-sequence is 11011, 
we can get ^i(3,si) =4 and f2l_f_j^=5-4+l=2. If the current k-sequence is 10000, 
then ;i(3,si)= 5-1-1, so 42l_^i= 0. 

2.3 Drawbacks of DBP 

Although DBP is more effective to guarantee (m,k)-firm constraint, there are still 
some drawbacks. The first one is that it only uses the distance to fall into a failure 
state of k-sequence whereas the whole richer information of ”0,1” distribution in 
k-sequence is neglected. In order to explain this problem, it is enough to consider 
different k-sequences with (2,5)-firm constraint: 11100 from ri and 11001 from T 2 - 
They have the same distance (17=2). If they arrive at the same time, according 
to the DBP, EDF algorithm is default adjunctive scheduling scheme. But is EDF 
optimal to deal with such condition? 

It appears that 11100 is less robust than 11001. For example, after a successful 
service of both, these two k-sequences become 11001(17 = 2) and 10011(17 = 4) 
respectively. But this is not necessary that we should firstly serve the next packet 
from Ti even if it has earlier deadline. In fact, how to use such information will 
depend on what we would like to optimize. Maybe it is more complicated to set 
the priority and get the optimal objection function. 

Another shortcoming of DBP is that it assigns priorities considering only 
one stream source without taking into account the parameters of other streams 
sharing the same server, which results in local priority that not global one and 
may lead to “priority inversion” phenomena. Improved algorithms to overcome 
the problem have been proposed in [6] [7]. 

3 Extended Distance-Based Priority Scheduling 

One of the possible solutions to explore the ‘0,1’ distribution in k-sequence is 
found when a stream falls in a failure state. Any packet in a failure stream 
has the same DBP value (17=0) but the distance to exit a failure state may be 
different. Let’s take again the (2,5)-firm constraint as an example, DBP assigns 
the same priority to the following different k-sequences in failure states: 00001 
and 10000, but to exit the failure state, 00001 needs only one more 1 whereas 
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10000 needs two consecutive Is. Especially in heterogeneous system, the streams 
have different (m,k)-firm constraints, the above situation will occur frequently, 
so it is necessary to consider factor of ‘0,1’ distribution. 

Based on the information in k-sequence, including the ‘0,1’ distribution, we 
propose to extend the notion of distance to failure state by introducing the 
notion of distance to exit failure state. 

For each stream with (m,k)-firm in a failure state, the priority is assigned 
based on the number of consecutive deadlines met that makes the stream back 
to meet its (m,k)-firm constraint. The number of necessary consecutive deadlines 
met referred to as distance to exit a failure state from the current state. 

The distance to exit a failure state is thus the number of consecutive Is 
adding to the right side. Formally, given a stream Tj with constraint parameter 
rrij and kj in a failure state, and let Sj = . . . , i5l) be its current 

k-sequence. Define lj{n,s) as the position (from the right side) of the miss 
in the state of Sj , so the distance to exit a failure state of stream is given by 
equation (2): 



— kj h(kj nij + 1 , Sj) + 1 ( 2 ) 

Example2: <?=2 for 100011 with (4,6)-firm constraint and <?=1 for 00011 with 
(3,5)-firm constraint. 

In a successful state, priority is assigned according to DBF, while in a failure 
state priority is assigned by equation (2), in case of priority equality, EDF is 
adopted, which is referred as E_DBP. As discuss above, the definition of distance 
to exit a failure state is a symmetrical notion of distance to a failure state, just 
as if we look at in a mirror, so the negative logic is applied. It is supposed 
that if the packet in the stream with smaller <P value gets higher priority, and 
a successful service for a packet adds 0 to the right of k-sequence, which will 
make the stream get more chance to be in failure states like DBF to guarantee 
successful states. But in fact, a packet is serviced by its deadline, 1 will be added 
to right of k-sequence, which will be more easy to a exit failure state. So it is 
reasonable to assign higher priority to the packet from the stream having smaller 
value. The detailed priority assigning process can be described as follows. 
E_DBP precedence among all being selected packets 

— If all streams are in successful states, the smaller DBP value , the higher 
priority. If DBP value 17 is the same, EDF is adopted. 

— If just only stream Tj is in a failure state, others are in successful states, the 
packet in the stream Tj, gets higher priority. 

— If many streams are in failure states at the same time, the smaller <P, the 
higher priority. If value is the same, EDF is adopted. 

— For all cases, if the same deadline, then FIFO. 

4 Simulation Result 

The new proposed algorithm, E_DBP is compared with DBP through simulation 
examples given in [4] . QoS in terms of probability of dynamic failure and delay is 
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taken as performance metric. In the results that follow, two packets generation 
patterns are considered: Poisson and burst. In a Poisson stream, packet inter- 
arrival times are exponentially distributed. A burst source alternates between 
ON and OFF states. When in the ON state, packets are generated periodically. 
No packets are generated when source is in the OFF state. The durations of 
the ON and the OFF states are exponentially distributed with averages ONave 
and OFFave respectively. Such a stream is often used to model a stream of voice 
sample in a conversation [8] . We firstly consider the case where all streams in the 
system have the same timing requirements. We also assume that only the packets 
that meet their deadline are serviced, which means that drop policy is enabled. 
Simulation adopts software OPNET8.0.C Modeler. Time duration of all projects 
is 20000. There we define e(t ) average system Load and }Te(t ) 

as average system (m,k)-Load, where E(Ti) is the mean inter-arrival time. The 
initial k-sequence is 11 .^. . 1 for a stream with windows ki. 

ki 



4.1 Evaluation of Dynamic Performances 

Poisson Streams The data in the left column in Table. 1 shows the probability 
of dynamic failure in one system with (3,4)-firm constraint. The system consists 
of five streams. All packets require a constant service time. Service deadlines are 
set equal to five times the packet service time. The packet inter-arrival time is 
exponentially distributed and the overall average load is varied from 1. 0-2.0 by 
changing inter-arrival time. As a result, it is shown that new scheme EJDBP can 
reduce the probability of dynamic failure, especially when it is overloaded. The 
maximum reduced percent is 9.3% in this case when load equals to 2.0. 

The above system considers that all streams have the same deadline require- 
ment with (3,4)-firm. The middle column in the Table. 1 shows the results for the 
heterogeneous system in which steams have different deadline requirement. The 
system consists of five systems with (9,10)-firm, (3,4)-firm, (l,2)-firm, (l,3)-firm, 
(l,4)-firm constraint, respectively. The packet service time, arrival pattern, and 
the deadline in this system are like those for the stream examined in the first 
example. The arrival rates of the packets are adjusted to get an average system 
load from 1.3-2. 3. Simulation result shows that even at load 1.4-1. 6, there is a 
little abnormal behavior that EJDBP is slightly worse than DBP (1.0% to 2.3% 
increase) . But there is still a strong trend that EJDBP can reduce the probability 
of dynamic failure, especially about 8.6% at load 2.3. 



Burst Streams The data in the right column in Table. 1 of probability of 
dynamic failure in a system with five burst streams. The ON and OFF periods 
of each stream are exponentially distributed with ONave=50 and OFFave=W0. 
The offered peak load of a stream is therefore three times the average load. 
When in the ON state, a stream generating one packet is 5 periodically. The 
deadlines are set to twice the generation period. Overall load varies by changing 
packet service time. We find that EJDBP is better than DBP to guarantee the 




Extended DBF for (m,k)-Firm Based QoS 363 



Table 1. Probability of dynamic failure in Poisson streams with same constraint, 
heterogeneous system and burst streams 





Poisson Stream 




Heterogeneous system 




Burst streams 




Avg 

Load 


DBP 


EJDBP 


% 

Rd 


Avg 

Load 


DBP 


EJDBP 


% 

Rd 


Avg 

Load 


DBP 


EJDBP 


% 

Rd 


1.0 


0.055 


0.055 


- 


1.3 


0.032 


0.032 


- 


0.5 


0.000 


0.000 


- 


1.1 


0.096 


0.095 


0.8 


1.4 


0.053 


0.054 


-2.3 


0.6 


0.006 


0.006 


3.3 


1.2 


0.156 


0.154 


1.3 


1.5 


0.083 


0.084 


-1.6 


0.7 


0.031 


0.031 


- 


1.3 


0.229 


0.223 


2.4 


1.6 


0.115 


0.117 


-1.0 


0.8 


0.078 


0.075 


4.0 


1.4 


0.311 


0.299 


3.9 


1.7 


0.157 


0.157 


0.3 


0.9 


0.150 


0.145 


3.3 


1.5 


0.398 


0.378 


5.0 


1.8 


0.200 


0.198 


0.8 


1.0 


0.227 


0.213 


6.0 


1.6 


0.481 


0.449 


6.5 


1.9 


0.245 


0.240 


1.8 


1.1 


0.296 


0.270 


8.7 


1.7 


0.557 


0.514 


7.8 


2.0 


0.293 


0.284 


3.1 


1.2 


0.371 


0.337 


9.1 


1.8 


0.623 


0.569 


8.7 


2.1 


0.339 


0.323 


4.8 


1.3 


0.449 


0.409 


8.8 


1.9 


0.675 


0.612 


9.2 


2.2 


0.381 


0.362 


6.6 


1.4 


0.509 


0.472 


7.4 


2.0 


0.716 


0.649 


9.3 


2.3 


0.431 


0.394 


8.6 


1.5 


0.556 


0.509 


8.4 



(m,k)-firm constraint and substantially reduce probabilities of dynamic failure 
obviously with average load varied from 0.5- 1.5. At load 1.2, there is maximum 
reduction percent 9.1%. It is obvious that the probability of dynamic failure is 
higher in burst case than in Poisson case at the same average load, because the 
peak load is heavier and the packets are more concentrated in the burst case. 

4.2 Delay Analysis 

Delay is the time interval between the departures of packet from the source to 
the arrival at the destination. This is usually referred to as end-to-end delay. 
In MIQSS model, delay just only means the queue delay. Delay is an important 
parameter of QoS. Many real-time applications such as voice over IP (VoIP), 
video-conference, and tele-medicine require guarantees on delay and packet loss. 
These applications are usually sensitive to delay and loss-tolerance. Smaller delay 
will make media stream more smoothly. This statistic represents instantaneous 
measurements of packet waiting times in the queue of server, and delay of all 
discarded packets is not calculated in this statistic. The simulation results also 
through the above three examples reveal that EJDBP can reduce the delay than 
DBP at different degree when (m,k)-load is varied. At some appropriate load 
duration, queue delay is decreased more effectively, but some light load, it is 
not so significant. The third example is burst stream, the packets are more 
concentrated at the ON state, so at a lighter (m,k)-load in burst case, we can 
still get the similar result with the Poisson stream cases. For the delay in the 
burst case, it fluctuates acutely because of concentration of packets. 
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Time Time 



Fig. 1. Queue delay comparison of EJDBP and DBP for Poisson streams with 
(3,4)-firm at (m,k)-load equals to 1 and 1.5 




Time Time 



Fig. 2. Queue delay comparison of EJDBP and DBP for heterogeneous system 
at (m,k)-load equals to 1 and 1.5 




Fig. 3. Queue delay comparison of EJDBP and DBP for burst stream (3,4)-firm 
at(m,k)-load equals to 0.5 and 1 
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5 Conclusions 

The main original contributions of this paper are: 

— Point out the drawbacks of the classic DBP when it is applied to a more 
general real-time context and corresponding possible solutions. 

— Propose EJDBP to correct DBP by taking the distribution of ”0,1” in k- 
sequence into account when stream is in a failure state, and give the equation 
to calculate the distance to exit failure state. 

— Show that EJDBP can get lower probability of dynamic failure and smaller 
queue delay than classic DBP, which is validated by various cases through 
simulation based on OPNET 8.O.C. 

This improvement is made with a very low computing cost or complexity, 
only to check minimum Is, which needs to be added to the right position. Fur- 
thermore, this new computing is only needed when the stream is in a failure 
state. In this sense, our algorithm is interesting in guaranteeing (m,k)-firm QoS 
in network scheduling. Furthermore, WFQ and RED combined with (m,k)-firm 
constraint also may be an interesting work. 
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Abstract. The high speed network usually deals with two main issues. 
The first is fast switching to get good throughput. At present, the state- 
of-the-art switches are employing input queued architecture to get high 
throughput. The second is providing QoS guarantees for a wide range of 
applications. This is generally considered in output queued switches. For 
these two requirements, there have been lots of scheduling mechanisms 
to support both better throughput and QoS guarantees in high speed 
switching networks. In this paper, we present a scheduling algorithm 
for providing QoS guarantees and higher throughput in an input queued 
switch. The proposed algorithm, called Weighted Fair Matching (WFM), 
which provides QoS guarantees without output queues, i.e., WFM is a 
flow based algorithm that achieves asymptotically 100% throughput with 
no speed up while providing QoS. 

Keywords: scheduling algorithm, input queued switch, QoS 



1 Introduction 

The input queued switch overcomes the scalability problem occurring in the 
output queued switches. However, it is well known that the input-queued switch 
with a FIFO queue in each input port suffers the Head-of-Line (HOL) blocking 
problem which limits the throughput to 58% [1]. 

Lots of algorithms have been suggested to improve the throughput. In order 
to overcome the performance reduction due to HOL blocking, most of proposed 
input queued switches have separate queues called Virtual Output Queue(VOQ) 
for different output ports at each input port. With VOQs, Input queued switches 
need matching algorithm to make input-output port pairs. 

Parallel Iterative Matching (PIM), which is one of Maximum Size Matching 
schemes, is a three-phase scheduling algorithm which uses parallelism, random- 
ness and iteration to achieve higher throughput [2]. Some variations of PIM such 
as iSLIP[3] appeared. iSLIP is very efficient and its throughput can reach 100% 
but does not address QoS problems. 
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Another proposed algorithm is RPA which realizes Maximum Weighted 
Matching (MWM) scheme [4], which is based on reservation rounds where the 
switching input ports indicate their most urgent data transfer needs. RPA took 
a similar approach with different scheduling algorithm as a proposed method 
presented in this paper. 

There has been a large amount of works on providing service guarantees 
in the integrated service networks. Various scheduling algorithms are proposed 
to provide QoS guarantees. Generalized Processor Sharing (GPS) is considered 
an ideal scheduling discipline [5] . The GPS is based on a fluid model where the 
packets are assumed to be infinitely divisible and multiple sessions may transmit 
traffic through the outgoing link simultaneously. Weighted Fair Queuing (WFQ) 
is a packetized generalized process sharing [6]. Some variations of WFQ, Self- 
Glocked Fair Queuing(SGFQ), Virtual-Glock(VG), Deficit Round Robin (DRR) 
etc. appeared in the literature to address the computational problem of WFQ. 

Most of algorithms for QoS provisioning have been done in the context of 
output queued switch where the speed of the switching fabric and output buffer 
memory is required to N times the input line speed. As line speeds increase and as 
routers have more input ports, the required fabric speed becomes infeasible and 
non-scalable. For these reasons, in addition to the demand for high throughput 
on routers or switches with input queued architecture, there is an increasing 
need for supporting applications with diverse performance requirements where 
QoS is guaranteed. 

However there has been a restriction to provide QoS guarantees in an input 
queued switch: input queued switch is scalable but lead to some packets not 
being promptly transmitted across switch fabric because enqueued packets can 
not be isolated, which may lead to violating QoS. Therefore the goal of providing 
QoS guarantees in the input queued switch is to design a scheduling algorithm 
which can provide QoS requirements so that queued packets are transmitted 
across the switch fabric promptly (i.e., throughput maximization). 

In this paper, we propose a scheduling algorithm for providing QoS guaran- 
tees and high throughput in an input queued switch. The proposed algorithm, 
called Weighted Fair Matching (WFM), which is a flow based algorithm that 
provides bandwidth allocation. Like other matching algorithms, it can achieve 
asymptotically 100% throughput under uniform traffic. 

The WFM in input queued switches is unique in a sense that the selection 
right and corresponding matching mechanism based on virtual finishing time of 
WFQ is done at the output port where the number of connections to the output 
ports and the virtual finishing time stamps already computed and transferred 
by input ports are involved. 

This paper is organized as follows. Section 2 gives a basic principle of the pro- 
posed scheduling method. Section 3 shows the performance based on simulation. 
The conclusion is drawn in section 4. 
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2 Weighted Fair Matching Algorithm 

We now propose an algorithm, WFM, which applies Weighted Fair Queueing 
(WFQ) [6] at the input port switch. This algorithm operates as a scheduler to 
avoid HOL-blocking and to provide QoS guarantees simultaneously. Like other 
scheduling algorithms in input queued switch, WFM uses multiple virtual queues 
at the input port for each output port. In this section, we first describe how to 
derive a WFQ in the input queued switch and then present a WFM. 



2.1 Applying Weighted Fair Queueing 



The GPS is an ideal scheme for fluid traffics which are assumed to be infinitely 
divisible and multiple connections may transmit traffic through the output port 
simultaneously at different rates. Its packetized version, WFQ scheduling algo- 
rithm can be thought of as a way to emulate the hypothetical GPS discipline by 
a practical packet-by-packet transmission scheme. 

For an N X N output queued switch, the bandwidth of each output port is 
shared by N flows. In this case, each output port has a WFQ system which is 
composed of a WFQ server and N queues for N flows. In every slot, each output 
port’s WFQ server selects one among its own queues. 

Figure 1 depicts the overall block diagram of WFM in a 2 x 2 input queued 
switch. We shall denote the fcth input and output ports by Ik and Ok, respec- 
tively. Let F{i,j) is the flow which is switched from li to Oj. 

Applying WFQ to the input queued switch is not much different from the 
case of the output queued switch. In the input queued switch, the flow, F{i,j), is 
a backlogged Q\ which denotes a virtual output queue for Oj in li . Like output 
queued switches, there are N WFQ systems. Let Sj be a WFQ server in Oj. 
This server includes a virtual-time for tracking normalized fair service amount. 
As shown in Fig.l, a WFQ system for Oj is composed of Sj and N VOQs located 
in the input ports and destined for Oj. 

In input ports, all arriving packets are tagged with virtual finishing times 
computed according to WFQ based on allocated bandwidth. The time-stamp, i 
TSF, associated with /c’th packet of the F{i, k) is calculated as follows: 



TSlk = max{v,{t),TS^-^} + 



J2I. 



( 1 ) 



where denotes a packet length and Vj{t) the virtual-time of Sj. 



2.2 Description of WFM 

In the previous subsection, we described WFQ to share each output port in input 
queued switch. For input queued switches, the main problem is how to match 
input-output ports to get high throughput. In [14], WFQ is used to make input- 
output port matching with simple sequential scheduling. But this approach did 
not show to provide QoS in an input queued switch. 
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Fig. 1. Weighted fair queueing in an 2 x 2 input queued switch 



We propose a scheduling scheme which operates as not only matching input- 
output ports but also providing QoS. As shown in figure 1, the switch model 
for WFM has non-buffered crossbar and its all output ports are connected to a 
shared medium, called a bus whose main role broadcasts information on input- 
output port matching to all output ports concerned. Three steps are used to 
resolve the conflict among input ports using. It is described as follows: 

Step 1: Request. Each input port sends a request to every output port 
for which it has a queued cell. Each request corresponds to nonempty VOQ 
and includes the time-stamp (i.e., virtual finishing time), TS^ p of the cell at 
the head of the VOQ. All received requests along time-stamps are stored at a 
request array of each output. The request array of Oj is denoted by Rj and 
it consists of N elements, denoted by Rj[i] with 1 < f < A^, which contains 
corresponding virtual finishing time. In addition, if an output port receives one 
or more requests, it counts the number of connections destined for itself which 
is denoted by Cj. 

Step 2: Sort and Output Port Selection. Every output ports are sorted 
based on the number of received requests (i.e., Cj ) in an increasing order. It 
determines turns of which output port is granted the right to select an input port 
ahead of other output ports. The reason why a sorting operation is performed 
with an ascending order is that an output port with the smallest number of 
backlogged flow has less input ports to match so that it is granted to make its 
selection earlier than the others, which brings a higher probability of matching 
pairs. For example, if two output ports have a relation with v'^(t) > v!^(t), The 
input-output pair for Om should be determined earlier than On- 

Step 3: Input Port Selection and Matching. Once granted to choose 
input ports, the output port picks up an input port with smallest virtual finishing 
time expressed by via time-stamps. Furthermore, at most one input port among 
unmatched input ports is chosen. On selecting an input port, the information 
about ’’matched ports” is transferred (or broadcasted) by a common bus to all 
output ports so that the same input is not permitted to be selected by another 



370 



S.-H. Lee, D.-R. Shin, and H.Y. Youn 



output. This process is repeated until all matchings are done sequentially at the 
output port. 

In short, the selection priority is granted to the output port with smallest 
number of connections denoted by Cj (via step 2), on the other hand, the match- 
ing mechanism by the selected output is done based on the time-stamps of input 
ports connected (via step 3), which is different from other approaches taken in 
RPA where a matching is done by input ports. 



3 Simulation 

We perform simulations to illustrate the capability of fast switching and QoS 
provisioning of the proposed method. With simulation experiments, we show 
about switching performance, delay control capability, bandwidth allocation ca- 
pability, and fairness. Each subsection describes those results. 



3.1 Switching Performance 

Simulation is performed on a 16x16 switch, where each input has 16 flows for 
each output port, totally 256 flows. Each flow reserves same bandwidth as each 
weight. To evaluate switching capability, we measured average delay time and 
compared it with iSLIP and RPA algorithms. In this simulation iSLIP operate 
as 4-iteration, because the minimum required number of iteration iSLIP is log 2 N 
[3]. Concerning the input traffic, we consider two types of models. 

1. Uniform traffic : cells arrive with Bernoulli arrival process, the cell output 
ports are selected with random independently 

2. Bursty traffic : cells arrive with on-oft arrival process modulated by a two- 
state Markov chain with destinations uniformly distributed over all output 
ports 



■S 100 



WFM Uniform 1 

iSUP(4) Uniform ---k--- 
RPA Uniform x 
WFM Bursty €b 
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RPA Bursty - -o- - 
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Fig. 2. Average delay under uniform traffic 
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Fig. 2 shows the curves of the average cell delay normalized with respect to 
slot time. For the uniform traffic, WFM provides improvement over iSLIP in 
average delay. At light loads below 60%, All have similar delay time, while for 
high load above 60%, delay time with WFM are less than half of iSLIP and 
RPA. For the loads less than 90%, RPA have longest delay time. As the result 
WFM provide improved delay performance over iSLIP and RPA, which is due 
to transient delay at a start and priority assignment. Hence, iSLIP is capable of 
achieving 100% denotes that WFM achieve 100% throughput for uniform traffic. 

For the bursty traffic, WFM has also best performance. In this work, the 
bursty on length is 32. 

3.2 Delay Control Capability 

Fig. 3 shows the capability of delay control. The simulation is performed under 
the same situation as in section 3.1, but all flows to each output port have 
different weights. Each output port has 16 flows, the flows’ weights are configured 
as 1 to 16. We took samples for F(l, 1), E(4, 1), F(8, 1), E(12, 1) and A(16, 1) 
at Oi- 

The delay control ability of WFM is compared with output queued switches. 
The result of the output queued switch is shown in Fig. 4. At light loads below 
60%, each flow’s delay is almost identical that of output queued switch, while 
for high load above 60%, delay time of all flows are more than those of output 
queued switch. However, WFM can control each flow’s delay. 




Load 



Fig. 3. Delay per flow using WFM 



3.3 Throughput with Weighted Fair Bandwidth Allocation 

In this subsection, we demonstrate WFM’s ability of allocating bandwidth among 
input ports in proportion to their reservations. The simulation is performed on 
a 8 X 8 switch where each input port has one flow, totally 8 flows, destined to 
Oi- Each flow is assigned the weight as 1 to 8. As shown Fig. 5, the bandwidth 
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Load 



Fig. 4. Delay per flow using WFQ in output queued switch 



is distributed in proportion to each flow’s weight under uniform traffic. WFM 
can allocate the switch bandwidth. 




Load 



Fig. 5. Throughput per flow under uniform traffic 



Fig. 6 presents the result under bursty traffic with busrty length 32. The 
bandwidth of each flow is also allocated in proportion to it’s weight. 



4 Conclusion 

In this paper we proposed a scheme, called weighted fair matching (WFM) for 
providing QoS in an input queued switch. We described how to apply a weighted 
fair queueing of the output queued switch to the input queued switch and pro- 
posed a simple matching method. The WFM is a flow based fair scheduling 
algorithm and operate sequentially. Its main feature is to provide good through- 
put and to allocate the output bandwidth in a simple manner. We showed that 
the proposed scheme achieved 100% throughput with low latency and provided 
QoS guarantees. 
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Fig. 6. Throughput per flow under bursty traffic 
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Abstract. As various multimedia communication services are increas- 
ingly required by Internet users, several signaling protocols have been 
proposed for the efficient control of complex multimedia communication 
services. However, the model and architecture of multi-party conferenc- 
ing which is currently being standardized by IETF has some limitation 
in scalability to meet the requirement for the management of large-scale 
multimedia conferencing service. In this article, we have presented a new 
scalable distributed architecture for the efficient management of large- 
scale multimedia conferencing service which is based on SIP. The high 
scalability is achieved by adding, deleting and modifying the multiple 
mixers and composing conference server network in a distributed way, in 
a real-time, and without disruption of services. The SIP-based control 
mechanism for achieving the scalability has been designed in detail. Fi- 
nally, the performance of the proposed architecture has been evaluated 
by simulation. 



1 Introduction 

Internet telephony services provide not only traditional voice services, but also 
application services based on packet, so these are applied in various multimedia 
communication services such as video, and multi-party conferencing. Demands 
for Internet telephony services are steadily increasing. Additionally, signaling 
protocols, applications, and models have been developed for efficient control of 
complex multimedia communication services. 

Internet telephony services use IETF standard protocols such as H.323, SIP, 
and MGCP for call control and signaling. An ITU-T’s H.323 defines the termi- 
nal and other components to provide multimedia communication on a packet 
network [1]. An SIP (Session Initiation Protocol) is an application layer signal- 
ing protocol standardized by IETF which defines initiation, modification, and 
termination of multimedia communication session between users [2]. Because 
SIP supports flexibility, user mobility, and various merits, its application area is 
wide. 

Currently, a great deal of research about conferencing models and control- 
ling mechanisms using SIP are being conducted to provide complex multi-party 
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conferencing. IETF MMUSIC working group studies the general requirements of 
Internet multimedia conference structure for supporting multi-party conferenc- 
ing [3]. IETF SIPPING working group proposes several drafts for multi-party 
conferencing based on SIP [4] . Stream processing like mixing and encoding about 
various types of media, and performance evaluation are studied in a conference 
based on centralized server model [8] . Based on the handling method for signal- 
ing and redistribution of stream, a conferencing model can be generally divided 
into end system mixing, multicast, centralized server, and full mesh and the 
characteristics of each models are described as shown in Table 1. 



Table 1. Characteristics of conferencing models 



Model 


Signaling 


Media 


Inviting 


Joining 


Scalability 


End System Mixing 


Tree 


Tree 


INVITE 


INVITE 


Small 


Multicast 


Pairs 


Multicast 


INVITE 


Multicast Join Large 


Centralized Server 


Star 


Star 


REFER 


INVITE 


Medium 


Full Mesh 


Star 


Full Mesh 


REFER-t 
Server msg. 


INVITE-t 
Server msg. 


Medium 



Even though multicast model has huge scalability, but it is hard to apply be- 
cause multicast is not deployed widely. Currently, the centralized server model 
is adopted as basic multi-party conferencing model. However, this model has 
limitations of scalability such as triangular transmission by using single confer- 
ence server, bottleneck by traffic concentration, and processing overload. Thus, 
a new scalable conferencing model supporting large scale multi-party conference 
is needed. More specifically, a new model needs to be designed which can fa- 
cilitate the transition to the new model and the integration with the existing 
conferencing model, using standard signaling method. In this paper, we suggest 
a new scalable distributed architecture for multi-party conferencing using SIP 
that can reduce traffic and processing load, and that can support scalability by 
constructing a special network for conferencing servers. 

In this architecture, a participant host acquires information of the adjacent 
conferencing server, and several conferencing servers can join a conference using 
this information. Stream can be delivered efficiently in this architecture. A dis- 
tributed conference can be constructed from the existing conferencing model by 
using the standard signaling procedure, and can distribute load by construct- 
ing a conference server network without changing the end host. Using adjacent 
conference server information and data distribution mechanism, it can provide 
a virtual multicast conference. 

In this paper, we analyze the features of the existing multi-party conferenc- 
ing models using SIP, and propose a new distributed architecture for multi-party 
conferencing which can support scalability, load distribution, and traffic distribu- 
tion. Specifically, we describe a signaling procedure and conferencing mechanism 
that can make this architecture more scalable. 
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2 Distributed Conferencing Architecture 

In the case of a centralized conferencing model, there are some problems due to 
the conference control mechanism using single server. First, according to server 
location, although the conference server is far away from the participants, data 
transmission between participants is always performed through the conference 
server. Bottleneck may occur because the traffic of all participants may be con- 
centrated to the server, and the processing load of the server can increase rapidly 
because the server must mix and encode all the streams. A centralized conferenc- 
ing model has limitation of scalability in a large scale conference environment. 
Therefore, a new multi-party conferencing model that can deliver stream ef- 
fectively and can provide scalability is needed. A new distributed conferencing 
model which can provide scalability is thus proposed. 

We propose distributed conferencing architecture for the large-scale multi- 
media conferencing service. In Fig. 1, we describe the distributed conferencing 
architecture which vertically consists of three tiers: a conference management 
tier, a mixer tier for multimedia stream processing, and the participants. The 
salient feature of the architecture is that the conference management tier is 
configured in a distributed way. 




Fig. 1. Distributed Conferencing Architecture 



In the distributed conferencing architecture, a conference consists of sev- 
eral local conference servers (CS). Each local conference server contains a focus 
which is responsible for the management of the corresponding local conferenc- 
ing service. The focus also manages the corresponding mixers in the region for 
load sharing and media streaming. In the architecture, one of the conference 
servers is designated as a primary conference server (PCS). The conference is 
horizontally comprised of PCS, several regional CSs for signaling and streaming 
of the conference. Both the PCS and the regional CS control the conferencing 
operations using the SIP signaling with some extension in accordance with the 
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conference policy. The CS also handles mixing and redistribution of multimedia 
streams such as conference video and audio streams. The set of CSs involved in 
the conference constitute a network called as a conference server network (CSN). 

The PCS is responsible for the control of the whole conference in an inte- 
grated way. It sets up the CSN and modifies the CSN. It also controls the access 
to the conference server so that the participants should first get the permission 
from the PCS to participate in a specific conference session. The PCS announces 
the conference session information using session announcement protocol (SAP) 
[12], and handles participation requests. The PCS can add and delete mixers 
according to the scale of conference, so that it can compose the CSN properly. 
Thus, the control and mixing operations are distributed in the proposed dis- 
tributed conferencing architecture, so that the processing overload and traffic 
concentration can be reduced. These features can greatly enhance the scalability 
of the conferencing system. 

Basically, one of the CSs in a conference is selected automatically to play the 
role of PCS. If the PCS leaves the CSN, a PCS transition procedure occurs so 
that another CS in the CSN can be the PCS. Through this, conferencing can be 
maintained without unnecessary CS. Since the CSN can be configured indepen- 
dently of participants, participants don’t have to take care of the composition 
of the CSN. This makes the signaling and stream procedure of the centralized 
conferencing model to be used without modification. Additionally, the triangular 
transmission which is caused by accessing the remote server can be eliminated, 
and delay and traffic in the core network can be reduced accordingly. 

In order to efficiently support the conferencing operation of the proposed 
distributed architecture, we have extended the SIP signaling method for the 
exchange of ACS information between the primary focus and the participants. 
This can be achieved by using ACSInfo header. ACSInfo header is newly defined 
extended SIP signaling information for the proposed distributed conferencing ar- 
chitecture. The primary focus uses the ACSInfo header information to configure 
a conference mixer network. The format of the ACSInfo header which is defined 
in the distributed conferencing model is show below. 

ACSInfo: SIP-URI or hostport 

However, even if an end host does not send ACS information, a CS can 
handle conference composition and signaling process. That is, if the end host is 
a ’Conference Unaware UA’, it can participate in a conference. 

A NOTIFY message is used to exchange the information of the participants 
when the conference status is changed. When participants join a conference, par- 
ticipants subscribe by using the conference notification message, and receive a 
notification message which contains the current conference status such as partic- 
ipant status, conference server status, PCS transition notification and CS table 
exchange. Using the participant status notification message, the PCS broadcasts 
the status change information to all the participants when a participant joins 
and leaves. The CS status message includes the information on the configuration 
of CSN and stream transmission mechanism. The CS table exchange message 
includes the participant list and the stream type of the mixer. 
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3 Signaling of Distributed Conferencing Architecture 

A distributed conferencing uses SIP signaling to construct a conference and 
CSN. A conference server operates like a general SIP UA, and a user can join 
a conference by submitting a connection request to the CS. Figure 2 shows 
a test network that is designed to examine conference composition procedures 
of the distributed conferencing. Procedures of invitation, joining, and leaving 
a test network, changing conference session, and making a PCS transition are 
examined. 




Fig. 2. Test Network for Distributed Conferencing 



• Conference Initiation 

Because a distributed conferencing is based on the dial-in conferencing model, 
the signaling procedure for invitation is to the same as the dial-in model. When 
host A invites host B, a CSl that is the ACS of A takes charge of the current 
conference signaling. Host A requests the creation of a session to the CSl and 
then CSl becomes the PCS for additional signaling and conference control. Fi- 
nally, host A and B are connected with the CSl according to the test network. 

• Inviting and Joining 

After a conference is created, if a participant wishes to invite a new user, the 
participant sends a REFER message to the relevant host. Then, the end host 
can join a conference by sending an INVITE message to the CS. If a end host 
wishes to join a conference, the host can acquire the session information about 
the conference by SAP and access the CS using the INVITE message. At this 
time, access point is always the PCS. The PCS connects with the current CS 
using the INVTE message of the participant, or if necessary, handles all signaling 
procedures after the CSN is composed by adding CS. 

If the ACS information of a new participant is equal to the ACS information 
of existent participants in a single CS conference, it is better to change for 
efficient stream flowing. For example, when host A invites host C and where 
host B and C have the same ACS information, the CS transition is occurred. 
The CS transition procedure is identical to PCS transition procedure. 

When a new participant joins a conference, the new CS can participate in 
the conference according to the CS policy. At this time, a new conference session 



A Scalable Distributed Architecture for Multi-party Conferencing Using SIP 379 



is formed between the conference servers. If the PCS decides to participate in a 
new CS, the PCS sends an INVTE message to the new CS. The CSN initiates a 
conference session, and the conferencing model of the CSN can be a full mesh, 
centralized, end system mixing, or hybrid type according to the conference’s pol- 
icy, If the CSN is established and a new CS is added, participants who have the 
new CS as ACS reestablish their session. For this, a PCS requires relevant par- 
ticipants to reconnect with the new CS sending a REFER message. Accordingly, 
relevant participants send an INVITE message to the new CS and terminate the 
session with the old CS. 



H-A H-D CSl H-B H-C H-E CS2 H-F H-G CS3 




Fig. 3 shows the signaling procedure for inviting a host G where hosts A 
and D are connected with a CSl, and host B, C, E, and F are connected with 
CS2. Because hosts F and G have CSS as their ACS, a CS transition and a CSN 
re-composition will take place. 

• Leaving 

When a participant wishes to leave a conference session while conference 
progresses, the participant sends a BYE message to the connected CS. The par- 
ticipant’s leaving can leads to OS’s leaving in CSN. When a participant leaves, 
if the number of participants belonging to a CS is lower than the value of mini- 
mum CS participants decided by policy, the CS leaves the CSN. Therefore, CS 
send REFER message to participants belonging to itself that participants will 
reestablish connection with another CS. When all of remaining participants leave 
CS, the CS sends a BYE message to the PCS and leaves the CSN. If the CS 
is a PCS, a PCS transition occurs. The PCS passes the role of PCS to one of 
the remaining CS, and notifies all participant of the PCS change by sending a 
NOTIFY message to everyone. 
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4 Performance Analysis 

In this section, we evaluated the performance of the distributed multi-party 
conferencing model. In particular, we compare the performance of the existing 
centralized conferencing model with the distributed model that we have pro- 
posed. We have measured a signaling delay, a stream transmission delay and a 
processing load of a conference server for the test network which is shown in 
Fig. 2. In the centralized mode of conferencing management, the CSl play the 
role of the centralized server, whereas in a distributed model, the management 
operations and the stream distribution is performed by CSl, CS2 and CSS in a 
distributed way. The ACS of the nodes A and D is CSl, and those of the nodes 
B, C, E and F, G are CS2 and CSS, respectively. 





(a) Average Signaling Delay for Invitation (b) Average Processing Load of Conference Server 



Fig. 4. Performance Analysis for the Distributed Conferencing Model 



Fig. 4 (a) shows the delay characteristics when the participant A invites other 
participants B, C, D, E, F and G. The delay is measured in the average signaling 
completion time. Initially, the participant A invites the participant B, and in this 
case, the average signaling delay is identical in both distributed and centralized 
models. This is indicated by the delay due to inviting the participant B. However, 
when the participant C is invited, the CSN is re-configured. A new conference 
server CS2 is joined to handle the media requests from the new participant C, 
and the existing participant B should be assigned to the CS2. This re-adjustment 
generates some delay so that the invitation completion time for the participant 
C is large as shown in Fig. 4 (a). This pattern of re-configuration continues to 
invite other participants D, E, F, and G, and the related signaling delay times are 
shown. As shown in Figure 4, the distributed conferencing model creates larger 
delay time than that of the centralized conferencing model. However, when the 
number of participants is large, and they are grouped and located in different 
regions, the signaling delay can be reduced since the regional CS can perform the 
conference management functions which are related to the corresponding region. 

Fig. 4 (b) shows the result of measuring the processing load for encod- 
ing/decoding and mixing at the conference server for the transmission of stream. 
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Both the processing load of the centralized conferencing model and that of the 
distributed conferencing model are shown in comparison. As shown in Fig. 4 (b), 
the processing load of the centralized model drastically increases as the number 
of participants increase, while in the distributed model, the processing load is 
almost constant. This illustrates the fact that the distributed model performs n 
better than the centralized model with regard to the scalability. 

In the case of a large conference, because a number of hosts connected to 
each CS are smaller than the centralized conferencing, a processing load of each 
CS is lower than the centralize server. Especially, if a CSN may be constructed 
a tree topology instead of a full mesh topology in a test network, a load of CSs 
may be more decreased. 

5 Conclusion 

In this paper, we suggest a new distributed conferencing architecture which can 
provide better scalability that appears to be very important feature in a wide- 
scale Internet community. We specifically design signaling procedures and con- 
ferencing mechanisms for this architecture. The proposed architecture facilitates 
both the integration with the existing models and transition from the existing 
models, providing efficient load and traffic distribution, thereby achieving great 
scalability. For further study, we are planning to apply the architecture in a real 
environment and to evaluate the performance enhancement by comparison with 
other works. 
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Abstract. This paper proposes a DC-mesh network that allows request- 
ing nodes to be put into clusters while the requests are sent to a target 
node, as well as is easy to layout on an LSI chip. To organize the DC- 
mesh, we use the partitioning in the word space based on the Hamming 
code [1]. We introduce an index scheme, (parity- value, information- value), 
in the word space, and map it onto a 4-D (dimensional) mesh so that the 
Hamming distance between the words in each partition is preserved in 
the Manhattan distance between the corresponding nodes on the mesh; 
two of the dimensions are contracted for easy wiring. The resultant DC- 
mesh consists of a number of local 2-D meshes and a single global 2-D 
mesh; all processing nodes linked to one local mesh are connected to one 
node of the global mesh via a bus to compensate for the contraction. 
A subset of the nodes in a partition organizes a dynamic cluster. The 
diameter equals the greater of the diameters of local and global meshes. 



1 Introduction 

To reduce communication latency in multiprocessors, a cache hierarchy in a static 
cluster, of which member nodes are fixed in hardware, is a popular technique. 
However, the static cluster cannot adapt efficiently to the change of communica- 
tion patterns, due to the contention on per-cluster resources such as the directory 
for cache coherence, and because of the complexity of cache protocols. 

For hypercube-connected systems, a dynamic cluster can be organized of 
which member nodes are determined during the requests are sent to the target 
node [2], exploiting the partitioning of the n-bit word space based on the n- 
bit Hamming code [1]; one cluster consists of a subset of nodes in a partition. 
No per-cluster resource is required for the clustering. The distance between the 
representative and another nodes in each cluster is less than three. 

The hypercube, however, needs long wires to layout, that will lead to an 
unacceptably long signal-delay in a future LSI chip [3] . This paper addresses such 
networks that need no long wires, but also can produce dynamic clusters. As a 
network with such properties, we propose a DC-mesh {Dynamically Clustering 
mesh). To organize the mesh, we map the partitions in the word space [1] onto 
a high-dimensional mesh so that the Hamming distance in the word space is 
preserved in the Manhattan distance on the mesh as completely as possible. 
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We start with a 2-D (dimensional) array of the words indexed by their parity 
and information values. With a reflected Gray code sequence, we map the array 
onto a 4-D mesh, but contract two of the dimensions for easy layout. This leads 
to multiple local 2-D meshes and a single global 2-D mesh that are disconnected 
with each other. To obtain the DC-mesh, we connect all processing nodes linked 
to a local mesh together with one node of the global mesh via a bus. 

Related work: Commercial and research systems adopt static clusters: 
STiNG [4] and SGI Origin [5] use the ring and an extended hypercube, re- 
spectively. The Stanford Dash is configured into the mesh [6]. A research chip. 
Hydra [7], has 4 processors and exploits two buses between their caches. A chip 
reconfigurable for several types of applications includes 64 processing nodes, that 
are connected with each other by the mesh [8] . 

The if'-cube [1] can produce dynamic clusters since it is organized through 
recursive partitioning based on the Hamming code. But this network will be 
unacceptable in an LSI chip due to its long buses, though it is much easier to 
wire than the hypercube. It is possible to organize dynamic clusters in multistage 
networks if each network switch has a directory [9]. Dynamic clusters are also 
organized by freezing the memory blocks in a specified cache and allowing the 
other caches to access the frozen blocks with no cache coherence [10]. 

A few methods based on the Hamming code and/or other linear codes have 
been reported to map the resources such as I/O processors onto the hypercube 
so that each node is adjacent to at least one resource [11] or a specified number 
of resources [12,13]. However, those methods exploit none of the properties of 
Hamming codes exploited in our partitioning. 

In the rest of the paper. Section 2 describes the properties of partitions 
[1]. Section 3 organizes the DG-mesh, and describes the routing method for 
clustering. Section 4 summarizes the paper and discusses future research. 

2 Properties of the Partitions 

This section describes the properties of Hamming code-based partitions that are 
used to organize the DG-mesh and dynamic clusters; for the detail, see [1] . 

A codeword c of the n-bit Hamming code '0(n, k) has p-bit parity for fc-bit 
information, where the p is the smallest integer that satisfies (2^ — 1) > n, and 
k + p = n. Assuming no or a single-bit error in a received word w, the syndrome 
e = w ■ indicates the erroneous bit position if e yf 0, or no error otherwise, 
where is the transpose of the parity-check matrix for tp{n, k). 

For partitioning, we exploit not only single-bit errors, but also detectable 
double-bit errors for which e > n. Then, of a pair of erroneous bit positions 
(d, /) {d + f = e, d < /), we fix position / equal to 2p~^; so d = e (B f- The 
number Nj, of detectable double-bit errors is equal to {2^ — 1 — n). 

The error vector for syndrome e has bit(s) of 1 at position(s) s (= e) or 
(d, /) {d+f = e), and bits of 0 in the other positions. Let Tc denote the partition 
represented by word c, and put a word w with syndrome e into partition Tc, 
where c = w ® e^ and 0 is the exOR operation. Then the n-bit word space is 
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partitioned into 2^ partitions each of 2^ words. The Hamming distance between 
the leader (i.e., the representative) word c and word w is less than 3. 

We produce multiple suits of codewords from the original suit Sq of codewords 
for the parity-check matrix . Then the suits organize another set of partitions 
for the word space, i.e., 2^ suits each of 2^ (code) words. Partitioned with any 
suit, the 2P words in each obtained partition are those in different (2^’) suits. Thus 
every word belongs to one of the 2^ partitions each obtained with a separate suit, 
and is the leader when partitioned with the suit including the word. 

To avoid traffic congestion on a single leader when clustering the requests for 
a target node t (see Section 3), we send the request from a requesting node s to 
one (£) of its 2^ leaders (each in a separate suit). This routing is based on the 
following property: Let S^t be the suit including a word t, and assume that a 
word £ is included in both suit S^t and partition T^. Then word £ is unique and 
is obtained by £ = s 0 e^, where £ is the syndrome for s 0 t with suit Sq. 

3 DC-meshes 

This section presents an indexed word space, the mapping from the space to the 
node-address space of the DC-mesh, and its structure and routing method. 

3.1 Indexed Word Space 

We assign the index {i,j) to the word, denoted by Wi^j, of which parity and 
information parts have values i and j, respectively; then an array of 2 ^ rows and 
2^ columns is organized. We denote the row and j**' column by Pi and Ij. 
Let Tp-neighhors of a word Wij be the non-leaders in the partition T^ij. Recall 
that N 4 number of double-bit error words have incorrect bits in positions (d, /). 
Let d' denote the position d in the parity part, and N^' refer to the number of 
double-bit error words with erroneous positions (d',f). Then, 

Theorem 1. Of the tf-neighbors of word Wij, k words are in row Pi, {p + Nw) 
words are in column Ij, and {Nd — Nd>) words are at the cross-points of row 
Pii^ef and columns I jc^ed (d & information-part), i.e., words Wi(^ef,j<sed- 

Proof. Of the single-bit error words of word Wij, k words have each an error in 
an information bit, so those words are in row Pi. Likewise, p words each with an 
error at a parity position are located in column Ij. Moreover, of the double-bit 
error words, Nd' words each with errors at positions {d' , f) are in column Ij, so 
a total of {p + Nd') words are in the Ij. Since {Nd — Nd') number of double-bit 
error words each have a pair (d, /) (d G information-part) of error positions, 
those words are located in the cross-points of row Pu^ef and columns Ij(^ed ■ 

3.2 Structure of the DC-meshes 

The parity size p does not vary so much (equals 3 or 4) for up to the n of 15, 
so we map each column Ij onto one local 2-D mesh, i.e., a basic block of the 
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DC-mesh. Let node (z,j) be the node on which word Wij is mapped. In parallel 
with this mapping, we map each row Pi {0 < i < 2^) onto a 2-D mesh, so that it 
consists of the nodes (i,j) (j = 0, . . . , 2^ — 1) each of different local meshes for 
columns Ij. Then a 4-D mesh is obtained. 

Since a 4-D mesh is generally difficult to layout, we contract the 2^ number 
of 2-D meshes for rows into a separate 2-D mesh, called global mesh, that is 
produced by contracting all 2^ nodes (z, j) (z = 0, . . . , 2^* — 1) in the local mesh 
for each column Ij into a single node, denoted by *j, of the global mesh. Then 
we obtain 2^ local meshes for the columns and one global mesh for the rows; 
note that these meshes have no connection with each other. 

The DC-mesh is organized as follows: We connect one processing node to a 
node of a local mesh by a direct link, and all processing nodes connected with the 
local mesh for Ij together to node of the global mesh by a bus. Moreover, to 
preserve the z/j-neighbor relation of the word space on the DC-mesh as completely 
as possible, we exploit the Gray code as the mapping function (see Definition 
1) in the word-to-node mapping described above, since this code allows the 
adjacency relation in the word space to be preserved on the mesh [14]. 

Let z(ri) and z(ci) be the upper r\ bits and the lower c\ (ri + c\ = p) bits 
of index i. Likewise, the upper T 2 , middle C 2 , and lower d 2 (r 2 -I- C 2 + c ?2 = ^) 
bits of index j are denoted by j(z’ 2 ), j(c 2 ), and j(c? 2 ), where d 2 equals (n — 8) 
if rz > 8 or 0 otherwise, and keeps the size of global mesh small, i.e., less than 
or equal to 4 x 4. Let *(j/2‘^^) denote the set of (words in the) 2^^^ columns of 
which indices j are the same in the r 2 -bit and C 2 -bit portions, but are different 
from each other in the d 2 -bit portion. Indices {xg,yg) and (xe,yi) are used for 
the nodes in the global mesh and the local mesh respectively. We 

denote the code in the sequence of Zen-bit Gray codes by G(m, len). 

Definition 1. (Mapping Method) We map word Wij in column Ij on the node 
{xi,ye) of the local 2'’^ x mesh and map the set *(j/2‘^^) of 2‘^^ 

columns on the node {xg,yg) of the global 2'’^ x 2'^^ mesh M®, where xi = 
yi = G"H*(ci),ci), Xg = G"^(j(r 2 ), rz), % = G"^(j'(c 2 ), cz), 
Zg = j(c? 2 ); ond G~^ is the inverse ofG. 

The mapping when n = 6 (and hence, k = p = 3) is shown in Fig. 1. A 
mesh node is shown by the rectangle, outside of which the node index is shown 
(only for Mg g and M®, for space). The two-digit integer ij for or *j for M® 
inside the node represents the index, (z, j) or *(j/2‘^^) (2^^^ = 1, in this case), of 
the word or column-set mapped on the node. Each column Ij is mapped onto 
a 2 X 4 (ri = 1 and ci = 2). Word (z = 5) in column /g, for instance, 
is mapped on node (G“^(5(ri), n), G“^(5(ci), ci)) = (1,1) of Mg g. The 2x4 
M® (rz = 1 and cz = 2) for rows is obtained by mapping the set of a single 
column (since dz = 0) onto node (G“^(j(rz), rz), G“^(j(cz), cz)). For example, 
column *3 is mapped on node (0, 2) of M®. 

The mapping when n = 10 (so p = 4 and Zc = 6) is shown in Fig. 2. In this 
case, ri = Cl = 2 and rz = cz = dz = 2, so both and M® are of 4 x 4; one 
is shown by the rectangle. Indices Xg and yg for meshes (and hence, of the M® 
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nodes) are shown in the left-most and upper-most parts. Since 2^^^ = 4, the set 
*(j /4) of four columns, denoted by *(j/4)zg {zg = 0, . . . , 3) in the box (j'/4 is 
expressed by a hexadecimal number), are mapped on the node *(j/4) with index 
(G“^(j(r 2 ), T 2 ), G“^(j(c 2 ), C 2 )) of the M®; this leads to the quintuplet of four 
M^s with indices (G~^(j(r 2 ),r 2 ),G~^(j(c 2 ),C 2 ),Zg) (zg = 0, . . . ,3) and the 
node. Index Zg (shown in the parentheses near the two M^s in two quintuplets, 
for space) of equals 0, 1, 2, and 3 respectively for the upper-left, upper-right, 
lower-left, and lower-right meshes in each quintuplet. To increase the bandwidth 
in M®, adjacent nodes of the are connected by four (= 2*^^) links. 







Fig. 1. The 64-node DC-mesh. 




Fig. 2. The Ik-node DC-mesh. 
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3.3 Routing and Clustering in the DC-mesh 

Assume a source and a target addresses, ((a;g(s) , yg(s), 2 :g(s)), (a:^(s), j/f(s))) and 
{{xg(t),Ug{t), Zg(t)), {Xi(t),yi(t))), of a massage in the DC-mesh. Then the message 
is sent according to the following XY-routing: 

Definition 2. (Routing Method) The message is sent to the target in the lo- 
cal mesh */ the global indices of the source and target are the 

same; (xg(s),yg(s),^g(s)] = (xg(t),yg(t), Zg(t))- Otherwise, it is first sent to node 
ixg(s),yg(s)) of the global mesh via the local bus Zg^s) of the source, next to 
node {xg(t),yg(t)) on the if Xg(s) yf Xg^t) or yg^s) 7^ yg(t) (this step is not 
required if Xg(s) = Xg(t) and yg;^) = yg(t)), last to the target {xi(t),yi(t)) tn the 
local mesh Mxg^t),Vg(t), 2 g(t) target’s local bus Zg(t)- 

Theorem 2. The Manhattan distance required for a message transfer equals 
l^f(s) ~ ^t{t)\ + \yi(s) ~ yi{t)\ if {Xg(s)Tyg(s)T Zg^gf) = {x g(t; , y g(t; , Z g(tf) , |a^g(s) ~ 
^g{t) \ + lys(s) “ Vg(t)\ ifXgis) + Xg^t) or yg^g) ^ yg(^), or 0 otherwise. 

Proof. This is clear since the message is sent via the local mesh in the first case, 
through the global network in the second case, and via the local buses otherwise, 
assuming that the connection, such as a point-to-point link and a bus, between 
a network node and a processing node is not counted in the distance. 

Thus the diameter of DC-mesh is equal to the greater of the diameters of 
local and global meshes. For the clustering, we use the value of word Wij mapped 
on node (i,j) as its address. The word value is easy to obtain hy i = G{xt, rfi) o 
G{yi, Cl), and j = G{xg, r 2 )oG{yg,C 2 )o Zg, where o is concatenation. A dynamic 
cluster is produced of a subset of the nodes in a partition. Let Cx be a dynamic 
cluster produced from the partition Tx represented by node x] note that node x 
may not be in the Gx, but we say that it is represented by node x. Recall that 
S'gt denotes the suit of words that includes word t. Then, 

Theorem 3. (Dynamic Clustering) If a node s requesting for a service from a 
target node t sends the request to node f = s 0 included in both partition Tg 
and suit S^t, then node s is put into the cluster Cg, where e is the syndrome for 
s (Bt with suit So . Moreover, node s is put into different clusters, and , 
for separate targets, t\ and t 2 , if they are in different suits, and S^t 2 - 

Proof. There is a unique node I included in both partition Tg and suit S^t (see 
Section 2). Since all requesting nodes send their requests to one of the nodes in 
suit S'gt, those nodes are partitioned into clusters . Particularly, node s is put 
into cluster Ct. For different target nodes, t\ and t 2 , such that S^ti yf S^t 2 , node 
s belongs to separate clusters, Ci„ and Qj, because leader nodes, £\ (= s0 Cei) 
and £2 (= s 0 ££ 2 ), are in different suits, and S^t 2 i where el and e2 are the 
syndromes for s 0 and s 0 ^2 both with suit So. 

The Hamming distance between node s and leader £ is less than 3 since 
s G Ci C Ti {or £ G Tg). The type of a request sent from leader £ to target t and 
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its issue timing depend on the applications. For instance, leader i relays the first 
received request to the target for cache coherence, or produces a request after 
receiving all requests and sends it to the target for barrier synchronization. 

Last we describes how the ^/>-neighbor relation is preserved in the partitions 
(and clusters) produced on the DC-mesh. We denote the Hamming and Man- 
hattan distances between indexes i and i' by and 

Theorem 4. (Preservation of the ip-Neighbor Relation) The ip-neighhor relation 
in the partitions of the word space is almost preserved in the partitions on the 
DC-mesh. Strictly, the Hamming distance of 1 in the word space is preserved in 
the Manhattan distance on the mesh, while the Hamming distance of 2 is mapped 
to the Manhattan distance greater than 0. 

Proof. Let’s consider the ^/;-neighbors of node (i, j). Then the (p -I- Nd>) num- 
ber of ^/j-neighbors are in the for column Ij (Theorem 1). Each of the p 
nodes has an index (i',j) (for a single-bit error in the parity part), and hence, 
HD{i,i') between the nodes (i,j) and (i',j) equals one. So MD{i,i') = 1 for 
those node, owing to the mapping function G. Likewise, each of the Nd' number 
of '0-neighbors has an index (i",j) (for an error in double bits both on parity 
positions), so that HD{i,i”) = 2. Generally, MD{i,i”) > 2 even with the map- 
ping function G (though MD{i,i”) = 2 for the mesh of which size is less than 
or equal to 4 x 4 such as shown in Figs. 1 and 2). Since each of the k number of 
0-neighbors has an index {i,j') (for a single-bit error in the information part), 
it is in the for column Iji . So the distance HD{j,j') of 1 is preserved in 
the distance MD{j,f) because the message is then sent from node *j to node 
*j' on the M®. Each of the {Nd — Nd>) number of nodes has an index {i',j') 
(for an error in double bits, one of which is in the information part). Then 
the HD between nodes (z, j) and {i',f) equals HD{i,i') HD{j,f) = 2, but 
MD{i,i') MD{j,f) = 1 since MD{j,j') = 1 and MD{i,i') = 0; the latter is 
the distance from node of the to node (i',f) via the bus. 

4 Conclusions 

We have proposed the DC-mesh for the dynamic clustering of nodes, as well as 
for easy layout on an LSI chip, exploiting the properties of partitioning based on 
the Hamming code. We first arranged the word space into an array so that the 
word with the parity value i and the information value j has index (i,j). Next, 
we mapped the indexed word space onto the node space of a 4-D (dimensional) 
mesh, according to the inverse of Gray code. Last, we contracted two dimensions 
of the obtained mesh for an easy layout. 

The resultant DC-mesh consists of multiple local 2-D meshes and a single 
global 2-D mesh; each local mesh is connected to a node of the global mesh by a 
bus to compensate for the contracted dimentions. The diameter of the DC-mesh 
is equal to the maximum of the diameters of local and global meshes. 

A dynamic cluster for a service in a target node is organized in a partition 
if its member node sends a request to its leader node, that is determined by 




DC-mesh: A Contracted High-Dimensional Mesh for Dynamic Clustering 389 



the addresses of the member and target nodes. Since the effective number of 
dimensions of the DC-mesh is greater than 2, the Hamming distance, that is less 
than 3, between the leader and non- leader words in each partition of the word 
space is almost preserved in the Manhattan distance between the corrsponding 
nodes on the DC-mesh. 

To increase the bandwidth between the local and global meshes, we can 
exploit multiple buses and hence, multiple global meshes, leading to a fat DC- 
mesh. Another approach to DC networks is to connect the nodes in each partition 
with each other by a single bus, leading to a bused fat-hypercube (fat due to the 
connections corresponding to double-bit errors). In any case, we need to evaluate 
the performance of these DC networks by simulation with real applications. 
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Abstract. The OTIS-hypercube is an optoelectronic architecture for inter- 
connecting the processing nodes of a multiprocessor system. In this paper, an 
empirical performance evaluation of the OTIS-hypercube is conducted for 
different traffic patterns and routing algorithms. It is shown that, depending on 
the traffic pattern, minimal path routing may not have the best performance and 
that adaptivity may be of no improvement. All judgments made are based on 
observations from the results of extensive simulation experiments of the 
interconnection network. In addition, logical explanations are suggested for the 
cause of certain noticeable performance characteristics. 



1. Introduction 

In order to exploit the speed and power advantages of optieal intereonneet (in eommu- 
nieation distanees exeeeding a few millimeters [1, 2]), the OTIS arehiteeture for inter- 
eonneetion networks has been suggested by Marsden et al. [3], Hendriek et al. [4] and 
Zane et al. [5]. Algorithmie properties of speeifie, eases sueh as the OTIS-hypereube 
and OTIS-mesh, have also been developed in the literature [7-13]. However, previous 
studies have, to our best knowledge, only eonsidered topologieal and algorithmie 
issues in OTIS eomputers, and no study has evaluated the performanee of these 
systems in sight of parameters sueh as bandwidth and message lateney, in view of 
realistie implementation assumptions. 

The main purpose of this work is to take a step in this direetion by initially 
developing a deadloek-free routing seheme for the OTIS-hypereube, and evaluating 
the performanee of the network under realistie eonditions and structural constraints. 
To this end, extensive simulation experiments have been conducted on the network, 
with different routing algorithms, traffic patterns, traffic loads, network sizes, 
message lengths and number of virtual channels. 



2. The OTIS-Hypercube and Its Router Structure 

In the OTIS-hypercube parallel computer, there are 2™ processors organized as 2^ 
groups of 2^ nodes each. The processors in each group form an N dimensional 
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hypercube that employs electrical interconnect. The inter-group interconnections are 
realized by optics. In the OTIS interconnect system, processor {i,j), i.e. processor j of 
group i, is connected via optics to processor (/, i). 

A node, in the n-dimensional OTIS-hypercube, or OTIS-//„ for short, consists of 
a processing element (PE) and a switching element (SE). The PE contains a processor 
and some local memory. A node is connected, through its SE, to its intra-group 
neighboring nodes using n input and n output electronic channels. Two electronic 
channels are used by the PE to inject/eject messages to/from the network. Messages 
generated by the PE are transferred to the router through the injection channel. At the 
destination node, messages are transferred to the local PE through the ejection 
channel. The optical channel is used to connect a node to its transpose node in some 
other group for inter-group communication. The router contains flit buffers for each 
incoming channel. A number of flit buffers are associated with each physical input 
channel. The flit buffers associated with each channel may be organized into several 
lanes (or virtual channels), and the buffers in each virtual channel can be allocated 
independently of the buffers in any other virtual channel [6]. The concept of virtual 
channels has been first introduced in the context of the design of deadlock free 
routing algorithms, where the physical bandwidth of each channel is multiplexed 
between a number of messages [6]. However, virtual channels can also reduce 
network contention. The input and output virtual channels are connected by a crossbar 
switch that can simultaneously connect multiple input channels to multiple output 
channels given that there is no contention over the output channels. 



3. Message Routing in the OTIS-Hypercube 

The routing scheme used for inter-group routing and the routing algorithm used for 
intra-group routing collectively determine the exact routing algorithm in an OTIS- 
hypercube network. In what follows, we refer to different routing schemes in order to 
identify only the manner in which a message travels between different sub-graphs 
(groups) of the network to reach its destination. Two basic routing schemes can be 
suggested for any source-destination pair of nodes in an OTIS-network. In the first 
scheme, a message is routed in the local sub-graph in which it starts until it reaches 
the node that has the same node address as the destination node. From that node, the 
optical channel is taken into another sub-graph. In this sub-graph, the message is 
routed until it reaches a node that has the same node address as the sub-graph address 
of the destination node. Once there, the message takes its final optical hop to the 
destination node. In the second basic scheme, a message is first routed to a node that 
has a node address equal to the sub-graph address of the destination. Once there, the 
optical channel takes the message to the sub-graph of the destination node. The 
message is then routed to the destination node within this sub-graph. 

Of the two former routing schemes, the one that takes a shorter path depends on 
the full address of the source and destination nodes. When considering the OTIS- 
hypercube, this can be determined easily. If the number of differing bits of the full 
address of the source and destination nodes is less than that of the source node and the 
transpose of address of the destination node, the first routing scheme will result in a 
shorter path. Otherwise, the second scheme will. However, it should be obvious that 
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in the first routing scheme, once the first optical channel has been taken, the 
remainder of routing can be conducted by the second scheme. Therefore, if in each 
intermediate node, a message is routed according to the basic scheme that takes a 
shorter path to the destination of the message (without considering the source node), a 
minimal-path routing scheme, the third scheme, is obtained. 

Routing within a hypercube may be deterministic, partially adaptive or fully 
adaptive. Any of these routing techniques may be used for intra-group routing in the 
OTIS-hypercube network. In order for a routing algorithm to be deadlock-free, cyclic 
buffer dependencies between messages and the virtual channels they allocate, must 
not occur. In the hypercube network, dimension order routing, a well-known 
deterministic routing algorithm, is inherently deadlock-free. Partially adaptive routing 
algorithms based on the turn model [14], such as p-cube routing, are also deadlock 
free. For fully adaptive routing to be deadlock free, virtual channel utilization must be 
restricted in a way, such as that suggested in [15]. But in an OTIS-hypercube, cyclic 
buffer dependencies between channels may also occur through the optical connections 
between groups. 

To prevent the occurrence of such cyclic buffer dependencies, messages that 
enter a group through an optical channel must traverse that group through a separate 
set of virtual channels from those of messages originating in that group. Therefore, we 
suggest that the virtual channels of each electronic channel be split into two equal 
sets, i.e. each group be split into two virtual groups, vi and V 2 . After being injected 
into the network, a message traverses the source group through Vj. But once an optical 
channel has been taken and the message has entered another group, that group is 
traversed through V 2 . 

When messages traverse only one optical channel in their path (the second 
routing scheme), no restriction is necessary on the utilization of the virtual channels 
of optical channels. But when messages traverse two optical channels (the first 
routing scheme), cyclic dependencies may still occur if all the virtual channels of 
optical channels are allowed to be utilized by messages taking their first optical hop. 
Thus, for the first routing scheme (and consequently the minimal scheme), one of the 
virtual channels of all optical channels must be reserved for messages that are 
traversing a second optical channel (entering their destination node). All the other 
virtual channels of optical channels can be allowed to be traversed with no restriction. 

Since a message that has traversed its second optical channel has definitely 
entered its destination node, it can not be part of a cyclic buffer dependency. It is for 
this reason that reserving one of the virtual channels of each optical channel, 
specifically for such messages, eliminates the possibility of the occurrence of cyclic 
buffer dependencies through the optical channels. In this manner, the deadlock-free 
nature of the specific hypercube routing algorithm, used for inter-group routing, will 
be is preserved in the OTIS-hypercube. 



4, Empirical Performance Evaluation 

The traffic patterns considered in our evaluation are 

Uniform', destination node can be any network node with an equal probability. 
Complement'. no&Q (a„_i ...«!«(,) sends message to node (a^_j ...a,a„). 
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Bit-reverse-, node sends message to node 

Bit-flip-, node sends message to node 

Butterfly-, nods. sends message to node (a„a ^_2 

Perfect-shuffle-, node (a„_, ...a,ao) sends message to node {a„_^a„_^...a^a„_,)- 

To evaluate the funetionality of the OTIS-hypereube network under different 
eonditions, a diserete-event simulator has been developed that mimies the behavior of 
the deseribed routing algorithms at the flit level. In eaeh simulation experiment, a 
minimum of 120,000 messages were delivered and the average message lateney 
ealeulated. Statisties gathering was inhibited for the first 10,000 messages to avoid 
distortions due to startup transienee. The average message lateney is defined as the 
average amount of time from the generation of a message until the last data flit of that 
message is eonsumed at the loeal PE at the destination node. The network eyele time 
is defined as the transmission time of a single flit from one router to the next, through 
an eleetrie eharmel. The transmission time of a flit, through an optieal ehannel is 
however a fraetion of the network eyele time. Messages are generated at eaeh node 
aeeording to a Poisson proeess with a mean inter-arrival rate of messages per 

eyele. All messages have a fixed length of M flits. The destination node of eaeh 
message has been determined through a uniform random number generator to 
simulate a uniform traffie pattern. 

Numerous simulation experiments have been performed for different seenarios of 
the traffie load, traffie pattern and routing algorithm for various message lengths and 
network eonfigurations. However, message length and network eonfiguration have 
been observed to be of no effeet on the proportional performanee of different 
seenarios. Henee for brevity, we report the results for only a typieal setting. This 
setting eonsists of a six dimensional OTIS-hypereube with four virtual eharmels per 
physieal ehannel. The ratio of optieal ehannel transmission time to eleetronie ehannel 
transmission time is equal to 1/10 and messages have a fixed length of 32 flits. 

In the following subseetions, we measure the performanee of different traffie 
patterns by means of the saturation point. The saturation point is the maximum 
injeetion rate at whieh the average delay is still bounded. It is assumed that when the 
average message lateney is higher than 200000 unit eyeles, the network enters 
saturation region. 



4.1. Uniform, Bit-Flip, Bit-Reverse, and Bntterfly Traffic Patterns 

In an OTIS-hypereube with uniform traffie, regardless of the routing seheme, the 
performanee of adaptive routing is superior to that of deterministie and p-eube 
routing, as ean be seen in Figure 1. Furthermore, the minimal routing seheme 
performs better than the first and seeond routing sehemes. But the interesting point is 
that deterministie routing saturates at a higher generation rate than that of p-eube 
routing. The OTIS-hypereube inherits this performanee eharaeteristie for uniform 
traffie from the hypereube network (Glass and Ni have reported sueh a eharaeteristie 
for the performanee of the hypereube network [14]). Due to the faet that, when used 
individually, the first and seeond sehemes do not always rout messages through an 
optimal path, one would expeet the minimal routing seheme to saturate at mueh 
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higher generation rates. But for adaptive routing, the difference between the 
performance of minimal routing and that of the first or second routing scheme is less 
than what may have been predicted. Thus, considering the extra complexity of 
implementing the minimal routing scheme, this scheme may not be an efficient option 
for such a system in which traffic is uniform. But, as will be shown in the following 
sections, for other traffic patterns, minimal routing may even result in performance 
poorer than that of the first or second schemes. 



6-DOTIS- hypercube, L= 32 (Uniform traffic) 







Fig. 1: Average message latency of uniform, bit flip, bit reverse, and butterfly traffic patterns 

As shown in Figure 1 for bit-flip traffic, compared to the minimal routing 
scheme, the network saturates at a much higher generation rate when the second 
scheme is used, with bit-flip traffic for all three inter-group routing algorithms. This is 
while messages travel a longer average distance with the second scheme. An 
explanation for this is that, with the first routing scheme, all messages generated by 
bit-flip traffic in a specific group, exit that group through the same optical channel 
(the optical channel exiting a node whose address is the bit-flip of the group address), 
creating a bottleneck in the network. It is also apparent from the results that, with the 
second routing scheme, the generation rate for which the network saturates is greater 
for P-cube routing than that for deterministic routing. The reason for this is that with 
bit-flip traffic in an OTIS-hypercube, when the second routing scheme is used, the 
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traffic within each sub-graph is also bit-flip traffic. As shown in [14], a hypercube 
network with P-cube routing saturates at a higher generation rate than that of 
deterministic routing for bit-flip traffic. This is while P-cube routing saturates at a 
lower rate than deterministic routing for uniform traffic. 

In the results obtained for bit-reverse traffic, also shown in Figure 1, it is 
observed that the generation rate for which the second routing scheme saturates is 
greater than that for minimal routing. Flowever, the difference between p-cube and 
deterministic routing is less than that for bit-flip traffic. The reason why the 
performance of the minimal routing scheme is so poor with bit-reverse traffic stems 
from the inefficiency of the first routing scheme. Similar to bit-flip traffic, the first 
routing scheme (not shown in this figure) causes all messages that are injected into a 
group to exit that group through a single optical channel (the optical channel exiting a 
node whose address is the bit-reverse of the group address). But this is not the case for 
the second routing scheme where messages use the optical channels to exit their 
source group evenly. With the second routing scheme used for bit-reverse traffic in an 
OTIS-hypercube, the traffic within each group is also bit-reverse traffic. This explains 
the superior performance of p-cube routing over deterministic routing, when the 
second scheme is used. 

With Butterfly traffic, resulfs of which are depicted in Figure 1, the performance of 
minimal routing is unquestionably better than that of the second routing scheme. Left 
out from this figure fo preserve clarify, are fhe results of the first routing scheme. 
These results have, however, shown the performance of the first routing scheme to be 
very close to that of the minimal routing scheme. But the interesting point is that, with 
the minimal scheme, there seems to be hardly any difference between the different 
routing algorithms. This is due to the fact that with butterfly traffic, the Flamming 
distance between the source and destination nodes of any message is equal to 2. It is 
thus, unsurprising that the degree of adaptivity with which those two hops are 
traversed is almost of no effect on the performance of the network. Another point is 
that, since the Flamming distance between the source and destination nodes is so 
small, the first routing scheme will almost always be selected by the minimal routing 
scheme. This explains why, for butterfly traffic, the performance of minimal routing 
is so close to that of the first routing scheme and why these two schemes are superior 
to the second scheme. 



4.2. Complement and Perfect-Shnffle Traffic Patterns 

With complement traffic, hardly any difference can be observed befween the 
performance of minimal routing and that of the second routing scheme. This can be 
observed in the results of Figure 2-a. But the first routing scheme saturates at a much 
higher generation rate than the other two schemes. The first routing scheme results in 
the path from source to destination of a message to be equal to the diameter of the 
network. Therefore, the second routing scheme will never rout messages through a 
longer path than that of the first scheme. Thus with minimal routing, the second 
scheme will always be selected. This explains why the minimal scheme and second 
scheme perform equally. With a complement traffic paffem, all messages injecfed into 
a specific group are destined to the same destination group (the address of which is 
complement to that of the source group address). Thus, with the second routing 
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scheme, all messages are routed to the same node in the source group, i.e. they all exit 
the source group through the same optical channel. As a result, excessive traffic load 
is imposed on some optical channels while others are left absolutely unused. Even the 
traffic load on the electronic channels becomes unequally distributed. But this is not 
the case for the first routing scheme, by which complement traffic is distributed 
evenly over the optical channels. This explains why, as depicted in Figure 2-b, the 
first routing scheme saturates at a much higher generation rate than that of the second 
scheme, even though messages traverse a longer average distance with the first 
scheme. In an OTIS-hypercube with complement traffic, there is also complement 
traffic within each group when the first routing scheme is used. 





(a) 




(b) 




Fig. 2: Average message latency of complement and perfect-shuffle traffic patterns for (a) low 
and (b) high generation rates. 



When the second routing scheme is used for perfect-shuffle traffic, all messages 
injected into a sub-graph exit that sub-graph, the traffic pattern within each group 
becomes somewhat similar to the perfect-shuffle pattern. The only difference 
corresponds to the LSB of the destination address. Therefore, considering that results 
presented in [14] show that with perfect-shuffle traffic in the hypercube, p-cube 
routing saturates at a lower generation rate than that of deterministic routing, it is 
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acceptable that through one of two optical channels and the other optical channels 
exiting that sub-graph are left unused. This, as in the case of complement traffic, 
results in the uneven distribution of traffic on optical channels, and consequently the 
second routing scheme suffers from early saturation. But unlike complement traffic, 
the minimal routing of perfect-shuffle traffic does not always utilize the second 
scheme. Nevertheless, the poor performance of the second routing scheme does affect 
the performance of the minimal routing scheme. As a result, minimal routing saturates 
at a generation rate only slightly higher than that of the second scheme, as revealed in 
Figure 2-a. 

In contrast to complement traffic, even the first routing scheme does not 
distribute perfect-shuffle traffic equally over the optical channels. Since the MSB 
(most significant bit) of the group address is rotated into the LSB (least significant 
bit) of the node address, the first routing scheme causes all messages of the same 
source group to exit that group through the optical channels of nodes with either even 
or odd addresses, depending on the MSB of the group address. Nonetheless, the first 
scheme does maintain superior performance over the second scheme. This can be 
observed in the results of Figure 2-b. The results obtained for adaptive intra-group 
routing based on the second routing scheme, shown in Figure 2-a, have been included 
in Figure 2-b once again to facilitate the comparison of the performance of the 
different routing schemes in this traffic pattern. 



5. Conclusions and Future Work 

A simulation-based evaluation of the performance of the OTIS-hypercube network 
has been conducted for three different inter-group routing schemes that we have 
defined {first, second and minimal routing schemes), three different intra-group 
routing algorithms (deterministic, fully adaptive and partially adaptive routing) and 
five different traffic patterns (uniform, complement, bit-reverse, bit-flip, butterfly, 
perfect-shuffle). We have shown that the method of routing messages between 
different groups of the network (the inter-group routing scheme) and the intra-group 
routing algorithm are of considerable influence on the performance of the OTIS- 
hypercube. However, we observe that (with the exception of uniform traffic) the 
inter-group routing scheme is generally of greater effect on performance than intra- 
group routing. Traffic patterns have also been found to be deeply influential on 
performance. It is found that with bit-flip, and bit-reverse traffic, the network 
saturates at higher generation rates when the second inter-group routing scheme is 
used, whereas poor performance is attained with the first routing scheme. The 
converse holds for butterfly, complement and perfect-shuffle traffic. This is while 
minimal routing is of superior performance only with uniform traffic. 

Consideration of these characteristics can serve as a guideline to the optimal 
mapping of tasks to nodes by the operating system of such multiprocessor systems. 
Our next objective is to derive a mathematical performance model of wormhole 
routing in the OTIS-hypercube, and to validate its prediction accuracy using 
simulation experiments. 
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Abstract. Paxson and Floyd (lEEE/ACM T. Netw. 1995) remarked the limita- 
tion of fractional Gaussian noise (FGN)) in accurately modeling LRD network 
traffic series. Beran (1994) suggested developing a sufficient class of paramet- 
ric correlation form for modeling whole correlation structure of LRD series. M. 
Li (Electr. Letts., 2000) gave an empirical correlation form. This paper' extends 
Li’s previous letter by analyzing it in Hilbert space and showing its flexibility 
in data modeling by comparing it with FGN (a commonly used traffic model). 
The verifications with real traffic suggest that the discussed correlation struc- 
ture can be used to flexibly model LRD traffic series. 



1 Introduction 

Modeling long-range dependent (LRD) series has been widely studied, see e.g., [1] ~ 
[6], where exactly self-similar (ess) process (i.e., fractional Gaussian noise (FGN)) is 
a commonly used tool, e.g., [1] [2] [5] [7]. Flowever, in communication networks, 
autocorrelation function (ACF) form of ess processes is too narrow for accurately 
modeling actual series [8]. On the other hand, accurate models of actual series are at 
the heart of some applications. For instance, accurate models of actual traffic series 
are crucial to performance evaluation of communication networks [9]. In addition, 
ACF has impact on queuing systems [10]. Motivated by those, we extend Li’s early 
work [6] for an empirically derived 3-parameter ACF form in Section 2. Verifications 
of this ACF form are given in Section 3 and conclusions in Section 4. 



1 This paper is in part sponsored by the Scientific Research Foundation for the Returned Overseas Chinese 
Scholars, State Education Ministry, PRC. 
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2 Empirical 3-Parameter Correlation Form 

Let X be an LRD time series and r be its ACF. Then, r(r) ~ cr~^ oo), where c > 0 
is a constant, Q< p< 1. Then, we aims at finding a function R{r) to best fit r(r). 

Generally, an ACF is nonlinear. Thus, modeling a measured ACF can be regarded 
as an issue of nonlinear least squares fitting. If a model is characterized by several 
parameters, nonlinear least squares in multi-dimensions may result in a set of nonlin- 
ear equations. Since a set of nonlinear equations may have no (real) solutions [11], it 
is needed to prove the existence of solutions. As numerical solutions for the root 
finding of a set of nonlinear equations are in the sense of approximation, a criterion is 
needed to evaluate the quality of curve fitting. 

Denote a measured traffic trace as v(t,), indicating the number of bytes in a packet 
at L, ; e 7o (= 0, 1 , 2, • • •). Let r be the measured ACF of x, R be the modeling of r and 

M\R) = E[(R - r)^] be the mean square error. Then, M\R) is used to evaluate the 
quality of curve fitting. In our scheme, M^(R) < 10“"' was required. Flence, our method 
to model r is to find R that fits r with the constrain of M^(R) < 10“"'. 

Let the error e = R- r. Construct the functional below 

Ae) = 

Based on the experiments, we present the following normalized correlation form 

R{k) = (|A:| + \)~“ + Lu{\k\ - m), a > 0, 1 > L > 0, m = 1, 2, k e 7, (2) 

where u is the unit step function. Consequently, y(e) stands for a 3-D cost function 

Jia,L,m) ^J{e). (3) 

Due to the evenness of ACF, we only consider k > 0 in what follows. An approxi- 
mated root (ao, 7.0, mo) of / = 0 can be determined by iteration based on nonlinear 
least squares fitting for a given r. The existence of solutions is explained below. 

In fact, a measured traffic trace is of finite length. Without losing the generality, 
the maximum possible length of x is assumed as p e Iq. Let N e lo and N » p. Then, 
N may be regarded as an “infinite” in the engineering sense. Denote 

< oo} is a Flilbert space [5]. Denote 
/4i = {7?; R{k) = c[{k+ \)~“ + Lu{k - m)\] . Then,.>^i <z/^. 

Statement. Let r e be a measured autocorrelation sequence. There exists a 
unique element R & such that llr - R\\ = inf llr - ^11, where ||r - 7?|| =/(e). 

Proof: is an obvious convex set anAfe) is a convex functional defined on /^. 

Therefore, the extremum of/(e) exists. Thus, Statement follows. 
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According to Statement, for a given re , if = R{k', Uq, Lq, mo) is such that 
llr - R\\ = inf llr - i’ll, (ao, Lo, nio) is called approximated root of J = 0 and (ao, Lq, mo) = 

arg min J{a, L, m). Thus, if M^{R) < 10“'*, R{k, ao, Lq, mo) is acceptable in our scheme. 
In the paper, (ao, Lq, nio) is obtained by Levenberg-Marquardt method [11]. 



3 Verifications 

Four well known real-traffic traces (dec-pkt-1, dec-pkt-2, dec-pkt-3 and dec-pkt-4) 
are analyzed. Denote R(k) as R(k; a, L, m). Then, the cost function for dec-pkt-1 is 
given by J{a, L, m) = ||./?(A:; a, L, m) - r^feiC^)!!, where r^fei(^) is the measured ACF of 
dec-pkt-1. By Levenberg-Marquardt method, one obtains (ao, To, tWo) = (2.091, 0.377, 
1). At this point, M^{R) = 1.952x10“^. Therefore, rpkt\{k) is modeled by 

R{k) = {k+ l)-^'’'*’ -f Q311u{k - 1). (4) 

Fig. 1 indicates dec-pkt-1 and Fig.2 the fitting the data for modeling rptt\{k). 




t(i) 



Fig. 1. TCP trace of dec-pkt-1 




Fig. 2. The result of fitting the data: 



measured ACF ; — modeled ACF 
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Fig. 3. Fitting rpyi{k) with ess model: measured ACF; — modeled ACF with FGN 

Similarly, we have (2.088, 0.402, 1), (3.14, 0.341, 1), and (3.14, 0.341, 1) for rpka{k), 
rpkaik) and rp^ik), respeetively. 

To evaluate the benefit of model (2), we use FGN to fit rpkt\{k). The normalized 
ACF of FGN is given by Ressik, H) =0.5[(A: + - 21^^ + {k- By using least 

squares fitting, we have the result Ress{k’, 0.93) with Af{Ress) = 0.003. As RessiK 0.93) is 
the best result in the ess sense, the benefit of our model is obvious, see Fig. 3. 



4 Conclusions 

A eorrelation form for modeling LRD traffie series has been given. The verifieations 
show that it has a noteworthy flexibility to model LRD traffie and satisfaetorily fits 
the real traffie investigated. This model has an advantage over models based on single 
parameter sueh as that of ess model. Beeause the modeled ACFs are non-summable, 
the long-range dependenee of traffie has also been verified in this way. 
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Abstract. Measurement of LRD traffic time series is the first stage to experi- 
mental research of traffic patterns. From a view of measurement, if the length 
of a measured series is too short, an estimate of a specific objective (e.g., auto- 
correlation function) may not achieve a given accuracy. On the other hand, if a 
measured series is over-long, it will be too much for storage space and cost too 
much computation time. Thus, a meaningful issue in measurement is how to de- 
termine the record length of an LRD traffic series with a given degree of accu- 
racy of the estimate of interest. In this paper, we present a formula for requiring 
the record length of LRD traffic series according to a given bound of accuracy 
of autocorrelation function estimation of fractional Gaussian noise and a given 
value of H. Further, we apply our approach to assessing some widely used 
traces in the traffic research, giving a theoretical evaluation of those traces from 
a view of statistical error analysis. 



1 Introduction 

The Internet is a complex system such that conventionally scientific computations are 
quite limited in the performance research of the global Internet. Therefore, measure- 
ment plays a key role in the performance research because measured data of real 
traffic reflect the information about real-life situations of the global Internet under 
current protocols and infrastructure. 

By analyzing measured data, findings regarding traffic were achieved in the last 
decade. In summary, 1) traffic is of long-range dependence (LRD), and 2) traffic is 
asymptotically self-similar [1]. The research of this paper will show that the particu- 
larity of LRD is also reflected in measurement. 

Recording traffic is the first stage for the experimental research of traffic patterns. 
Here, we ask for a question how to validate the reasonableness of measured traffic 
data. To explain this question, we ask for another question whether there was another 
global Internet that was superior to the current one we are using so that it could be 
used for measurement validation, e.g., data validation/assessment, in the standardiza- 
tion sense. Unfortunately, the answer is NO. The global Internet has the property of 
uniqueness. In addition, simulating the Internet encounters painful difficulties [2]. For 
those reasons, conventional approaches for validation/assessment of measurement 
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data in the field of measurement (e.g., [3]) fail for the Internet traffie measurement. 
Henee, the theoretieal researeh in measurement of LRD traffie is expeeted. 

For measuring a random sequenee, an important thing is that a measured sequenee 
should have enough length so as to provide an enough aeeurate estimate of an objee- 
tive (e.g., autoeorrelation funetion (ACF)). In the field of measurement, however, 
length requirements of a measured random sequenee are traditionally for those with 
short-range dependenee (SRD), e.g., [4]. Intuitively, length requirements of LRD 
sequenees should be distinetly different from those of SRD sequenees beeause LRD 
proeesses evidently differ from SRD ones. Flowever, we have not seen any reports 
about reeord length requirements for traffie measurement, to our best knowledge 
(exeept Li’s early note [5]). This paper will show that the length requirement of a 
measured LRD sequenee does drastieally differ from that of SRD one. Note that the 
result in this paper is based on ACF estimation of fraetional Gaussian noise (FGN). 
Flowever, parameters to be eonsidered in praetiee may not be ACF of FGN in mono- 
fraetal but others, e.g., the Flurst funetion [6]. Therefore, the result in this paper may 
be eonservative but it may yet be a referenee guideline for reeord length of traffie in 
aeademie researeh and praetiee. 

The rest of paper is organized as follows. In Seetion 2, we present the formula for 
requiring reeord length of measured LRD traffie with a given aeeuraey and a given 
value of //based on ACF estimation of FGN. Diseussions are given in Seetion 3 and 
eonelusions in Seetion 4. 



2 Upper Bound of Standard Deviation 

Denote x(i) = x(t,) ( ; = 0, 1,2, . . .) as a traffie traee, representing the number of bytes 
in a paeket on a paeket-by-paeket basis at the time /. Mathematieally, x{i) is LRD if 
its ACF r{k) is non-summable while x{i) is ealled asymptotieally self-similar if x(ai) 
(a > 0) asymptotieally has the same statisties as v(;). 

In mathematies, the true ACF of x{i) is eomputed over infinite interval. Flowever, 
any physieally measured data sequenees are finite in reeord length. Let a positive 
integer L be the data bloek size ofv(;)- Then, r{k) is estimated over finite interval. As 
known, a useful (aetually widely used) model of traffie is FGN [7] [8]. Its normalized 
ACF is given by 0.5[(k -i- 1)^^ - +{k - 1)^^ ], where H is the Flurst parameter. 

We take it as a representative of LRD traffie for our researeh about reeord length. 

Suppose r(r) is the true ACF of FGN and R{ f) is its estimate with L length. Then, 
/? is a random variable. Let M^(R) be the mean square error in terms of R(t). Then, 
M\R) = Var(/?) [4]. We aim at finding a relationship that represents M\R) as a two- 
dimension funetion of L and H so as to establish a referenee guideline for requiring 
reeord length for a given degree of aeeuraey. We represent this relationship by the 
following theorem. 



Theorem. Let x{f) be a FGN funetion with H e (0.5, 1). Let r(r) be the true ACF of 
x{t). Let L be the bloek size of data. Let R{r) be an estimate of r(r) with L length. Let 
Var[/?(r)] be the varianee of/?(r). Then, 
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Var[7?(r)] + 

L/yZii + ij 

where is the varianee of FGN. 

The proof of Theorem is omitted due to the limit spaee. Without losing the gener- 
ality, we eonsider cr = 1 . Denote s{L, H) as the bound of standard deviation in the 
normalized ease. Then, one has 



s{L,H)- 

Following (2), we see that s{L, H) is an inereasing funetion of H. 



1 



L{2H + \) 






( 2 ) 



3 Discussions 

From (2), it is seen that a large L is required for a large value of H (strong LRD) for a 
given S'. In engineering, aeeuraey is usually eonsidered from the perspeetive of order 
of magnitude. When H ^ 0.55, 0.75 and 0.95, one has 0.118, 

s(L, 7/)|^^28 //=o 75 ^ 0.621. These show that Ls vary 

in orders of magnitude when H = 0.55, 0.75 and 0.95 for a given s, implying a series 
with larger value of 7/ requires larger L for a given s. 

An exaet value of s{L, H) usually does not equal to the real aeeuraey of the eorrela- 
tion estimation of a measured LRD-traffie sequenee beeause FGN is only an asymp- 
totieal expression for real traffie [9] and traffie is multi-fraetal in nature. On the other 
hand, there are errors in data transmission, data storage, measurement, numerieal 
eomputations, and data proeessing. In addition, there are many faetors eausing errors 
and uneertainties due to the natural shifts, e.g., various shifts oeeurring in deviees, or 
some purposeful ehanges in eommunieation systems. Therefore, the eonerete aeeu- 
raey value is not as pressing as aecuraey-order for the eonsiderations in measurement 
design. For that reason, we emphasize that the eontribution of s{L, H) lies in that it 
provides a relationship between s, L and H for a referenee guideline in the design 
stage of measurement. 

Table 1 lists some well known traees on WAN. Now, we evaluate lLbl-pkt-4.TCP 
of 1.3x10® length, whieh is the shortest one in Table 1. For 77= 0.90 (strong LRD) 
and s being in the order of 0.1, we ean seleet L = 2^. Beeause Theorem provides a 
eonservative guideline due to inequality used in the derivations and the assumption of 
mono-fraetal model of FGN, we verify that those traees are quite lengthy for ACF 
estimation as well as general pattems/struetures of traffie. 
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Table 1. Six TCP packet traces 



Dataset 


Date 


Duration 


Paekets 


dec-pkt-l.TCP 


08Mar95 


10PM- 11PM 


3.3 million 


dec-pkt-2.TCP 


09Mar95 


2AM-3AM 


3.9 million 


dec-pkt-3.TCP 


09Mar95 


lOAM-llAM 


4.3 million 


dec-pkt-4.TCP 


09Mar95 


2PM-3PM 


5.7 million 


Lbl-pkt-4.TCP 


21Jan94 


2AM-3AM 


1.3 million 


Lbl-pkt-5.TCP 


28Jan94 


2AM-3AM 


1.3 million 



4 Conclusions 

We have derived a formula representing the aeeuraey of the eorrelation estimation of 
FGN as a 2-D funetion of the reeord length and the Hurst parameter. It may be eon- 
servative for real traffie but it may yet serve as a referenee guideline in measurement. 
Based on the present formula, the noteworthy differenee between measuring LRD 
and SRD sequenees has been pointed out. 
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Abstract. A low-cost parallel QoS Sensitive domain, which supports load bal- 
ancing network architecture, is developed in the paper. To deal with the scaling 
problem, a large network, thus, is structured by grouping nodes into parallel 
domains. Different from traditional approaches, especially hierarchical struc- 
ture, parallel structure aggregates the topologies of domains in a new way. The 
corresponding routing algorithm, which adopts two skills for low-cost routing, 
QoS line segment and swapping matrix, is to emphasize particularly on load 
balancing within networks. Finally, Simulation results show appealing per- 
formances in terms of different metrics. 



1 Introduction 

QoS Routing is a process for the purpose of finding a path from the source node to 
the destination node and satisfying end-system performance requirements. Routing 
messages of a multi-metric QoS routing algorithm, consume enormous amount of 
network resources. For instance, an algorithm in [6] for finding a route with additive 
QoS constraints is proposed, but its price for broadcasting routing messages is 
scarcely taken into account. Meanwhile, the seriousness of load balancing is intended 
to cause a prodigious waste and congestion partially or full-scaly. QoS routing based 
on parallel-domain network, better than traditional aggregation suggested in PNNI[1], 
effectually reduce the outburst of unbalanced traffic load by ameliorating route com- 
putation versus QoS provisioning and route computation versus load balancing. 



2 Network Architecture and Domain Models 

A domain is modeled as a tuple (N, M, E), where N is the set of nodes, M(^N is the 
set of border nodes, and E is the set of physical links. The QoS parameter of a physi- 
cal link is denoted as a QoS pair (D , B), in which D is the best delay of the link and B 
is the residual bandwidth. If a path is made up of n-1 physical links, each of which is 
characterized with a QoS pair {Dt, Bt), the QoS pair of a physical path is denoted as: 
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When there exist m (m>l) paths between two nodes, a QoS pair set 
like , Bg ]i {p[, Bi)... {p , B'^ )} must be used to denote QoS information between 

two nodes. A network, denoted as (G, L) consists of a set of domains and joint links, 
where G = {g, = {N[, Mi, Ei), 1 < / < |G|}. L is the set of inter-domain links, each of 
which is denoted in the same way as the intra-domain links in a domain. 



3 QoS Sensitive Aggregations 

Fig. 1 shows two line segments of a physical path between two nodes denoted by 
\low_left, high_right\ at time Ti and T 2 . Different line segments would have different 
crankback and err-denial. Similarly, the QoS approximation of a logical inter- 
domain path can also be denoted by a line segment such as Ipath, which must include 
three essential parts, the originating node and domain, the outgoing/incoming border 
node and domain, and the current incoming/outgoing border node and domain. 



Fig. 1. A line Segment are generated by the least square method applied on its QoS pair set. 
The shift of line segments may occur with the process of time. Route requests in the unshaded 
area between the line segment and the staircase can not be served at all, which is called crank- 
back that is caused by the distortion of line segment. Meanwhile, it would reject feasible re- 
quests as it doesn’t cover some areas that belong to staircase, which is called err-denial 

A line segment can be evaluated for its approximate residual traffic load on the 
path. For a line segment [(Do , Bq), (Di , BQ], its residual traffic load is defined as: 



There are totally X outgoing border nodes w, x... z within domain Y adjacent to 
domain Z and node i is the current incoming border node within domain Z, or there 
are totally X incoming border nodes w, x. . . z within domain Y and node i is the cur- 
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rent outgoing border node within domain Y. When a route request originates from 
node o within domain X, the swapping matrix for node i is defined as follows: 



[/ .1 
OWl 




tow Jeft ’ ^ low _left 


high right ’ ^ high _ right 


1 . 
OXl 


= 


^ low left ’ ^ low left 


high _ right ’ ^ high _ right 


_ ^ ozi 


/Ixl 


low left ’ ^ low Jefl )„„■ 


high _ right ’ ^ high _ right 



4 Line-Segment and Balance-Swapping Routing Algorithm 



A problem with two constraints called "shortest weight-constrained path" was listed 
in [4]. Furthermore, as a four-metric routing algorithm, LBRA’s nature is close to the 
centralized bandwidth-delay routing algorithm (CBDRA) as in [3]. Like [5], LBRA 
prefers source routing mechanism to Ftop-by-Flop routing mechanism because it 
needs the topological border information to analyze resource allocations. There are 
totally two conjoint levels in LBRA, inter-domain routing level and intra-domain 
routing level. Intra-domain routing level is analogous to CBDRA if residual traffic 
load is not taken into account. Suppose that there is a route request from node o with 
the QoS pair breq)- Let Z)„ Di+i...Dj+„ be the estimated sum of delay from the 
source node to outgoing border node i, i+l...i+n within the same domain, which 
incoming node k doesn’t he in. The link between node i and node k is an inter-domain 
link,vt (Z) it, B ,Q. So is the link between node ;+7 ... and node k. 



j 7 0; 0; 

do{ 

if (B., < b ) break; 

' ik req ' ' 



else{ 



} 



use (D B .^) and to form swapping matrix M. ; 

® path “ ^ path ^ oik i f 



1 ++; 

}while(n+l times) 

B'poth= j ; 

if(B'poth > 

DQ = ; 

if((DQ< DJ) && (D, <= d„Q){ 

Dk= DQ; B„ 

1 > 

An outgoing node runs LBRA similar to an incoming node, except that the out go- 
ingnodeuse [ B.^^) , B'.^^)] and 1^. to form a swapping matrix M.. 



— R ^ 

“*path path ^ 



5 Simulation Results 

The simulated network consists of 9 domains with a total of 253 nodes. The number 
of borders varies from 4 to 6. All nodes are connected by directed links and each 
node is connected to at least 2 other nodes in the same domain. The delay of each link 
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is between 2 ms to 15 ms and the bandwidth ranges from 640K to 6.4M Bytes/s. Fig. 2 
refleets routing sueeess ratio w.r.t. bandwidth and link utilization. 



— ■-LSRA—*— LBRA^*—Dijkstra algorithm — Peak at LSRA --- Average at LSRA 




Bandwidth (kBytes/sec) Time (s) 



Fig. 2. In the left sub-figure, LBRA outperforms Dijkstra algorithm and LSRA as in [2] in 
routing success ratio evaluation w.r.t bandwidth. In the right sub-figure, LBRA doesn’t cause 
any congestion and keep the smooth trend of link utilization on the random link, while LSRA 
fails to obviate the outburst of unbalanced traffic load 



6 Conclusions 

In the paper, I use line segments and swapping matrix to aehieve a load-balaneing 
QoS sensitive network and develop LBRA whieh can find the most appropriate QoS 
route. Simulations show a stable and high performance of different metrics. 



References 

1. The ATMForum: Private Network-to-network Interface Specification Version 1.0 (pnni I.O). 
fpnni-0055. 000 (1996) 

2. King-Shan Lui, Klara Nahrstedt: Topology Aggregation and Routing in Bandwidth-delay 
Sensitive Networks. In Proceedings of GLOBECOM '00. IEEE (2000) 

3. Z. Wang, J. Crowcroft: Quality-of-service Routing for Supporting Multimedia Applications. 
In IEEE Journal on Selected Areas in Communications. Vol.l4. (1996) 

4. M. R. Garey, D. S. Johnson: Computers and Intractability - A Guide to the Theory of 
NPCompleteness. Freeman, California, USA (1979) 

5. D. Estrin, Y. Rekhter, and S. Hotz: Scalable Inter-Domain Routing Architecture. In Proceed- 
ings of ACM SIGCOMM’92, Maryland (1992) 

6. Turgay Korkmaz, Marwan Krunz, and Spyros Tragoudas: An Efficient Algorithm for Find- 
ing a Path Subject to Two Additive Constraints. In Computer Communications Journal, Vol. 
25. (2002) 225-238 



Reversible Cellular Automata Based Encryption 



Marcin Seredynski’’^, Krzysztof Pienkosz’ and Pascal Bouvry^ 

* Warsaw University of Technology 
Nowowiejska 15/19, 00-665 Warsaw, Poland 
seredynskiSacn . waw . pi , K. Pienkosz@ia.pw . edu . pi 
^ Luxembourg University of Applied Sciences 
6, rue Coudenhove Kalergi, L-1359, Luxembourg-Kirchberg, Luxembourg 
pascal . bouvry@univ . lu 

^ Polish- Japanese Institute of Information Technology, Research Center 
Koszykowa 86, 02-008 Warsaw, Poland 



Abstract. In this paper cellular automata (CA) are applied to construct a sym- 
metric-key encryption algorithm. A new block cipher based on one dimen- 
sional, uniform and reversible CA is proposed. A class of CA with rules spe- 
cifically constructed to be reversible is used. The algorithm uses 224 bit key. It 
is shown that the algorithm satisfies safety criterion called Strict Avalanche 
Criterion. Due to a huge key space a brut-force attack appears practically im- 
possible. 



1 Introduction 

The increased use of computers resulted in a strong demand for means to protect 
information and to provide various security services. Encryption is a primary method 
of protecting valuable electronic information. It transforms the message (plaintext) 
into the cipher text. The opposite operation is called decryption. Two forms of en- 
cryption are in common use. They are called symmetric -key and public -key encryp- 
tion [3]. If both sender and receiver use the same key, or it is easy to obtain one from 
another then the system is referred to as symmetric key. If the sender and receiver 
each uses different key, and it is computationally infeasible to determine one from 
another without knowing some additional secret information then the system is re- 
ferred to as a public-key. There are two classes of symmetric-key encryption algo- 
rithms. They are called block and stream ciphers. A block cipher breaks up the mes- 
sage into blocks of fixed length and encrypts one block at a time. A stream cipher 
encrypts a data stream one bit or one byte at a time. This paper deals with symmetric- 
key block ciphers. 

In this paper we apply CAs to construct a new symmetric -key algorithm. CAs are 
highly parallel and distributed systems, which are able to perform complex computa- 
tions. They have been used so far in both symmetric -key and public -key cryptogra- 
phy. CA-based public cipher was proposed by Guan [1]. Stream CA-based encryption 
algorithm was first proposed by Wolfram [9] and later some other algorithms were 
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developed by Tomassini et al. [6], and recently by Seredynski et al. [4]. Block cipher 
using both reversible and irreversible rules was proposed by Gutowitz [2]. 

This paper presents a new encryption algorithm based on a class of reversible CA 
with rules specially designed to be reversible. It is organized as follows. The coming 
section defines elementary and reversible CA. Section 3 presents the idea on how 
particular reversible CA class can be used for encryption. Detailed description of the 
encryption algorithm based on reversible CAs and its analysis can be found in section 
4. Section 5 concludes the paper. 



2 Cellular Automata 



2.1 Elementary Cellular Automata 



One-dimensional CA is an array of cells. Each cell is assigned a value over some state 
alphabet. CA is defined by four parameters: size, initial state, neighborhood, rule and 
boundary conditions. Size defines number of cells. Each cell updates its value syn- 
chronously in discrete time steps accordingly to some rule. Such rule is defined over 
the neighborhood which is typically composed of the cell itself and its r left and right 
neighbors: 



t+\ nf t t t t t \ 

5,. =R{s,^, 



( 1 ) 



where s\ is a value (state) of i-th cell in step t and r is radius of the neighborhood. 
Example of rule definition for radius 1 neighborhood is shown on Fig. 1 . 



t 

t+1 




E^o 1^1 I 

>1' 'I' 'I' 




ono 0 OjllOlO^Oj 
0 0 n 









◄ Neighborhood (n,) 



R = 0*2 '+ 1*2 ® + 1*2 ®+ 0*2 ^+1*2 0*2 0*2 ’+ 1*2 °= 105 



Fig. 1. Elementary rule 105 definition 



There are 8 possible neighborhood configurations for radius 1 neighborhood. As 
shown on Fig. 1 state transition must be defined for each possible case. 

When dealing with finite CA, cyclic boundary conditions are usually applied 
which means that CA can be treated as a ring. Changing values of all cells in step t is 
called CA iteration. Before the first iteration can take place some initial values must 
be assigned to all cells. This is called the initial state of CA. By updating values in all 
cells the initial state is transformed into a new configuration. When each cell updates 
its state according to the same rule, CA is said to be uniform. Otherwise such CA is 
called non-uniform. In this paper one-dimensional, uniform CA defined over binary 
state alphabet with neighborhood size two and three is used. 
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2.2 Reversible Cellular Automata 



When analyzing elementary CA it turns out that only a small number of mles have 
the property of being reversible. For example, among all 256 radius 1 CA only six are 
reversible. This is why class of CA with mles specially created to be reversible is 
considered. Different reversible CA classes are presented in [5]. In this paper we are 
using reversible CA class presented by Wolfram [8]. In this class mle depends not on 
one but on two steps back: 



(+1 



= m- 



t t t+i I t 



)■ 



( 2 ) 



In the elementary CA value of i-th cell in configuration t+1 depends on the value 
of the state of itself and r of its neighbors in configuration t. In this reversible class 
additional dependency is added. Now, the value of the central cell s\ * in step t-1 is 

also considered. Such a mle can be simply constmcted by taking elementary CA mle 
and adding dependency on two steps back {t-1). Example of such mle definition is 
shown on the Fig. 2. 



Definition for t-1 = 1 



Definition for t-1 = 0 



.!;1 n n H H ■ ■ ■ ■, 

!ji'^ elementary rule 105 (Ri) I 



t-1 0 0 


P. 0 0 


} 0^ 0 




inim 

t+1 II 0 


0 Ifollllb 0 oira o| 


10 0^ 0.0 


0 

elementary rule 150 (Ri) 



Fig. 2. Reversible mle 105/150 definition 



Definition of the mle is now composed of two elementary mles. The first one is de- 
fining state transition in case when in step t-1 central cell was in a state 1, and the 
second one, when that cell was in the state 0. Fig. 2 gives an example of reversible 
mle based on two elementary mles: 105 and 150. These two mles are complementary 
to each other. Knowing one value it is possible to calculate the second one using the 
following formula: 

R^=2“-R,-\, ( 3 ) 

where d = 2^ . The same mle is used in forward and backward iteration. The 

total number of radius r mles is 2‘^ . Since a reversible mle depends now on two 
steps back, CA initial state must be composed of two successive configurations la- 
beled and . 
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3 The Idea of Using Reversible CA Class in a Block Cipher 

When using reversible CA deseribed in the previous section, plaintext is encoded as a 
part of CA initial state - configuration . Configuration is set up with random 

data. Encryption is done by forward iteration of the CA by fixed number of steps 
according to some reversible rule. This rule forms a secret key. Encryption algorithm 
is shown on Fig. 3. 




Fig. 3. Encryption using reversible cellular automata 



Configuration q^^ ^ is a ciphertext. There are two options concerning configuration 
q^ (called final data ) generated by the encryption. The most secure one assumes that 
this information is kept secret, which means that configuration q^ together with rule 

forms the key. The disadvantage of this option is that the key changes with each en- 
cryption. This is because the key is now a function of the rule, plaintext and random 
initial data. In the second option, the final configuration q^ is encrypted using Ver- 
nam cipher [3]. It is done by applying exclusive-OR operation (XOR, © ) on the final 
configuration q^ and the key. Now encrypted final configuration no longer has to be 

kept secret, and can be added to the ciphertext. To obtain final data for decryption, 
XOR operation must be applied to encrypted final configuration and the key. 

For the decryption, the same operations are applied in reverse order. CA initial 
state is composed of the final data and the ciphertext. CA is then iterated for the same 
number of steps as used for encryption. The same secret rule taken from the key is 
used. 

Good encryption algorithm should satisfy the Strict Avalanche Criterion (SAC) 
[7]. This means that each output bit should change with a probability of one half 
whenever a single input bit is complemented. In this paper, we are using 32 cells 
radius 2 CAs and 64 cells radius 3 CAs. In order to satisfy SAC, 32 cells radius 2 
CAs should be iterated for at least 22 iterations, while 64 cells radius 3 CAs should be 
iterated for at least 19 iterations. These results are based on 10000 experiments for 
each parameters set (CA size/rule size/iteration number). For each experiment ran- 
domly generated rule and initial configuration were used. 
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4 Cipher Based on Reversible Cellular Automata 

Our cipher is composed of four one-dimensional CA labeled CA32L, CA32R, CA64, 
and CAs. Both CA32L and CA32R are composed of 32 cells. CA64 is composed of 64 
cells and CAs is composed of 16 cells. There are two inputs to the encryption func- 
tion: plaintext and the key. The plaintext is divided into 64 bit blocks. The key is 224 
bits in length (see Sec. 4.2). Automata CA32L, CA32R, CAs use radius 2 reversible 
rules and CA64 uses radius 3 reversible rule. Encryption process of a single plaintext 
block consists of 16 rounds (typical number for block ciphers). Each round is com- 
posed of 4 operations: iterations in CA64, CA32L, CA32R and Shift transformation. 



4.1 Details of a Single Round 

Fig. 4 presents details of a single round of encryption. 




Fig. 4. Single round of the algorithm 



Each round starts with two 64-bit values labeled and that fom initial 

state of CA64. In the first round of encryption, is a plaintext to be encrypted. 
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Configuration called initial data is described latter. In case of rounds 2-16, both 

configurations get their values from the result of the preceding round. After being 
initiated, CA64 is iterated for 19 steps. The result is divided into four 32 bit values 
labeled (1\l ^ir- Then configurations and^g^ are shifted re- 

spectively left and right by ns cell positions (described latter). Shifted values together 
with and q^j^ form initial state of CA32L, CA32R. Next, both CA are iterated for 
22 steps. The result forms two 64 bit values labeled qoji„^i and . This values 

become initial data for the next round: qQf,„ai becomes and qyji„^i becomes 

^Unit ■ After the last round qQfi„ai is a 64 bit ciphertext block. 

Successive blocks encryption is shown on the Fig. 5. 



|Randomdalawurc«| 




Fig. 5. Successive blocks encryption scheme 



For the encryption of the first plaintext block, configuration ^g,„,Y {initial data) is 
generated randomly. Otherwise, from the encryption of the previous block is 

used as . Configuration q^jin^i (called final data) obtained from the encryption 

of the last plaintext block encryption is encrypted (see Sec. 4.2) and then added to the 
ciphertext. 

Values used for Shift operation are generated by CAs. Before first block is being 
encrypted, CAs is initiated with random initial values. Then whenever new ns value 
is needed, CAs is iterated for 5 steps. The consecutive values of the central cell form 
5-bit ns value. For each round new ns is generated. Final configuration of CAs is 
encrypted (see Sec. 4.2) and added to the ciphertext. 

Decryption is using the same operations in reverse order. The only difference is 
that the selected output from CA32L is shifted right and output from CA32R is shifted 
left. 
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4.2 Key Structure 

Each automata runs with its own reversible rule. The key is composed of 4 rules cor- 
responding to each automaton in the following order: CA32L, CA32R, CA64, CAs. 
There are three radius 2 rules (CA32L, CA32R, CAs) and one radius 3 rule (CA64). 
Each radius 2 rule is 32 bits in length while radius 3 rule is 128 bits in length. This 
makes the key size of 224 bits. Final data configuration is encrypted using 

XOR operation and bits 0-63 of the key and final configuration of CAs is encrypted 
using the same operation and bits 192-223. The key should be generated using some 
high quality random number generator. 



4.3 Cryptanalysis 

There are 2^^"* possible keys, which means that a brute-force attack appears practi- 
cally impossible. 

Another attack consists in assuming to find final states of CA32L, CA32R, CAs. In- 
deed the knowledge of the two successive configurations of the CA makes finding a 
rule much easier. For CA32L and CA32R the final state is composed of the last cipher- 
text block and final data configuration. Last ciphertext block is known so we only 
need to find out two 32 bits final data configurations. For CAs final state is composed 
of two last 16 bit configurations. Knowledge of the final state of CAs enables finding 
values used in Shift transformation. Since final states of CA32L, CA32R and CAs were 
encrypted using XOR operation and some random data (key) it can only be found by 

enumeration. There are 2^^ possible values of final data configurations of CA32L, 
CA32R and final state of CAs. We also need to know the rule used for CA64. There no 
information on any configuration used in CAs, which means that there is no way of 

telling which rule was used. Described attack assumes verifying 2^^^ possible com- 
binations of “final states of CA32L/CA32R/CAS / CA64 rule” (2^^ *2^^ * 2^^ *2*^* ) 
Complexity of this task is just the same as the one for brute-force attack. 

Greater security can be achieved by using larger block size but it reduces encryp- 
tion/decryption speed. 



4.4 Cipher Properties 

Our reversible CA-based cipher works in a mode that is similar to CBC mode [3] in 
terms of achieved result. The same plaintext block that appears in the whole plaintext 
more than once produces different blocks of ciphertext. This is because encryption of 
each plaintext block starts with some initial data taken from the encryption of the 
previous block (each time a different value). In DBS like ciphers there is still problem 
with encryption of the same plaintext more than once, or when two encrypted plain- 
text begin with the same information (in both cases the same key is used). In the first 
case the same ciphertext will be produced, while in the second case both plaintexts 
will be encrypted the same way until the first difference is reached. It is possible to 
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overcome this problem by encrypting some random data block first. In the proposed 
cipher encrypting the same plaintext with the same key will always result in a differ- 
ent ciphertext. This is achieved by using randomly generated data and new initial 
configuration of automaton CAs for encryption of each plaintext. 



5 Conclusions 

In this paper we have presented an idea on how a particular class of reversible CAs 
can be used in cryptography. We have given an example of a new block encryption 
algorithm based on that idea. One dimensional, radius 2 and radius 3 CAs are used. 
The algorithm uses 64-bit blocks as an input. The same operations are used for both 
encryption and decryption. Strict Avalanche Criterion is satisfied. Due to a huge key 
space a brute-force attack appears practically impossible. 
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Abstract. As malicious intrusions span sites more frequently, network 
security plays the vital role in internet. Intrusion detection system(IDS) 
is expected to provide powerful protection against malicious behaviors. 
However, high false negative and false positive prevent intrusion detec- 
tion system from practically using. After survey of present intrusion de- 
tection systems, we believe more accurate and efficient detection result 
can be obtained by using multi-sensor cooperative detection. To aid- 
ing cooperative detection, an ontology consisting of attribute nodes and 
value nodes is presented after analysis of IDSs rules and various classes 
of computer intrusions. On the basis of ontology, a matchmaking method 
is given to improve flexibility of detection. Cooperative detection frame- 
work based on the ontology is also discussed. The ontology proposed in 
paper has two advantages. First, it makes the detection more flexible and 
second it provides global locality information to support cooperation. 



1 Introduction 

Since intrusion detection was introduced in the mid-1980s [1], intrusion detec- 
tion system(IDS) has developed for almost thirty years to enhance computer 
security. However high false negative and false positive prevent ID system from 
practically using. After analyzing the reason of high false alarms rate, we think 
the inefficient detection is partly caused by insufficient audit data sources and 
lack of cooperating multi-sensors data. Many of IDSs depend on only one kind 
of sources: network data or host based data. However many intrusions can shows 
character in both of these two data sources. If more sensors data can be utilized 
to perform intrusion detection, the alert will be more accurate. 

The key problem is how to correlate multi-sensor information to evaluating 
the security state of monitored system. In this paper, we argue a cooperating 
detection framework based on ontology. Ontology can provide detection system 
with the ability to share a common conceptual understanding and provide re- 
lationships among heterogeneous audit data. Based on ontology we present our 
cooperative framework to correlate the information from multi-sensor. A flexible 
and efficient matching algorithm is also given to perform detection. 
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The remainder of the paper is organized as follows: Section 2 presents related 
work in cooperative detection by using multi-sensor. In Section 3 we present the 
ontology established from host and network data feature, give the matchmaking 
algorithm and the ontology based cooperating function. Some experiment results 
will be present in Section 4. In Section 5, we conclude our work and discuss future 
research. 



2 Relate Work 

Some of IDSs have used both of host and network data to perform detection. 
BIDS [3] accepts the notable event records from each of the host and LAN mon- 
itors and sends them to the expert system. The expert system is responsible 
for evaluating and reporting on the security state of the monitored system. The 
detection model is the basis of the rule base and consists of 6 layers, each layer 
representing the result of a transformation performed on the data. EMERALD [4] 
is a scalable distributed intrusion detection system that operates on three dif- 
ferent levels in a large enterprise network. EMERALD introduces a recursive 
framework for coordinate analyses from distributed monitors to provide a global 
detection. However neither of these two systems addresses data sharing between 
host and network data. 

In [5] Peng addresses that most intrusions are not isolated but related as 
different stages of attack sequence, with the early stages preparing for the later 
ones. Proposed approach correlated alerts using prerequisites of intrusions. It can 
discover high-level intrusion scenario and reduce the impact of false alert. But 
the detection rate is constrained by the low-level IDSs. Frincke in[6] proposes 
principles for a framework to support cooperative intrusion detection across 
multiple domains and describes a prototype cooperative data sharing system 
illustrating many of those principles. 

An ontology is an explicit specification of the concepts and relationships and 
is widely used in many research areas, such as semantic web, knowledge man- 
agement, AI etc. There are not much literature reports about applying ontology 
to IDS. In [7] Pinkston gives a target-centric ontology to descript the concepts 
within ID domain and relations between them. They implement their ontology 
model by DAML-I-OIL language. But they do not seem to make full use of cor- 
relating and inheritance relationship to perform detection. 



3 The Ontology Based Cooperative Detection 

3.1 Ontology 

The term ontology means a specification of a conceptualization. An ontology is 
a description (like a formal specification of a program) of the concepts and re- 
lationships in a specific domain. Using ontology in intrusion detection system is 
to provide powerful constructs that can guide cooperating detector to exchange 
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machine interpretable message. Understanding other cooperating detectors’ mes- 
sage and description of current status is vital in cooperating work. We design an 
ontology after analysis some IDSs rules and the security vulnerabilities published 
by Common Vulnerabilities and Exposures (CVE)[8]. Compare to the ontology 
provided by [7], our ontology focuses on the features that can be observed by 
each sensor. We not intend to estimate the motivation of the intruder as [7] have 
done. 

The complete ontology includes two kinds of nodes: value nodes and attribute 
nodes. Attribute nodes describe all the features that can be observed by multi- 
sensor and value nodes are the children of some attribute nodes which represent 
the possible value of their parent attribute node. Fig. 1 illustrates a part of our 
ontology which only has attribute nodes. The complete ontology not is given be- 
cause it would make the illustration clumsy. At the root of the ontology is attack 
signature. The subclass of root is attribute nodes that represent different fea- 
tures from heterogeneous sensors including host sensors, network sensors, router 
logs etc. The higher level node of ontology means more abstract feature and is 
the locality of the lower level nodes. By the ontology, we can know on which sen- 
sor we can find concerned information. For example, if we require the memory 
total usage information. We can learn from the ontology that, this value can be 
obtained from the system status sensor on the host. This is useful in cooperative 
detection process because it help us to locate the required information. 




Fig. 1. The Overview of Ontology 
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3.2 Ontology Based Matching 

Signature intrusion detection system often uses string matching or more pow- 
erful expert system to perform matchmaking. String matching system, such as 
snort [9,10], is a simple substring matching of the characters in the text. Such a 
mechanism of course is not of considerable flexibility. If attack signature changes 
a little, the system will neglect it. For example, a backdoor server uses port ’666’ 
to connect with client. If the server changes communication port to ’555’, the 
string matching system will not detect it until another rule added to rules base. 
The expert system are more powerful and flexible, however execution efficiency is 
influenced by more complex mechanism. In our approach we give matchmaking 
method based on above ontology that is a little similar to the powerful expert 
system but has high execution efficiency. 

Each node in the above ontology (in Fig. 1) is feature nodes that show the 
attributes of collection information. In the complete ontology will also include 
the value nodes which show the value distribution. The parent of value nodes is 
the attribute that has these values. If the attribute has a continuous value, some 
discrete preprocessing should be taken. In the above given example, the intruder 
changes backdoor program communication port form suspicious ’666’ to a seldom 
used port ’555’ and establishes the connection from an unknown foreign address 
other than the known suspicious one. To illustrate this scenario, a rule node 
’Rulel: Step 1/2’ and the observed suspicious data node ’intrusion’ are attached 
to some value nodes in the ontology shown in Fig. 2. ’Step 1/2’ suggests that 
this rule matchmaking has two steps and here it is the first step and it will be 
illuminated in detail in the next section. Although the observed phenomena is 
not exactly matching the rule , we know it is still the same backdoor intrusion. 




Fig. 2. Ontology based Matchmaking 
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To evaluate the relationship and similarities among the rules and data, each 
edge between value node and its parent node is assigned a weight. The weigh 
from 0 to 1 shows the relationship among the different values that has the same 
parent node. The 1 means the two values nodes has the maximum similarity 
while 0 means the minimum. For example, the ’Port’ node has the values of 
’Suspicious’, ’Seldom used’, ’Frequent used’ and we store the weight of node 
N by vector Vn{w\,W 2 , ■■Wn)- Wi means the weight between N and its brother 
node i. In above example, the three vectors are Auspicious Aeldom(®’^’ 
Arequent(®’^’ 0.5, 1) and 0.8 in Aeldom(®-^’ ms^ms the weight between ’Sus- 
picious’ and ’Seldom used’ is 0.8. The weight is empirically assigned by expert 
and will be adjusted according to feedback of result. The autonomous adaptive 
weight assigning method has more advantage [11], which will be our future work. 
The weight between ’Suspicious’ and ’Seldom Used’ is 0.8 because a seldom used 
port is always utilized by intruder to perform malicious communication. How- 
ever, the weight between ’Suspicious’ and ’Frequent used’ is only 0.1 as the 
frequent used port does not seem used by intruder. 

There exists a set of paths, marked as path-set{Nn, Nj), that from node 
’Rule’(fV/j)to ’lntrusion’(Ab) which will be utilized to evaluate the similarity 
between two nodes. It is not necessary to traverse all paths that between Nr 
and N[ and only path between two attached nodes that have the same attribute 
parent node are considered. In the above example, only three paths, path from 
’666’ to ’555’, path from ’Suspicious’ to ’Unknown’ and path from ’TCP’ to 
’TCP’, are considered. If Nr and IV/ are attached to the same node in one path, 
the detector will give the maximum score 1. If not, the similarity score depends 
on the path weight between the two nodes Ni,Nj which they are attached to 
separately. Then the total score is calculated by sum up all the path similarity 
score. 



pathj3imij3Corepath(*j) 



total^core(A''fl, IV/) = 



= Vi{wi,W 2 ,..Wn) * (0i,02,..ej..,0„)' (Cj = 1) 

E Path^imij5corepj^tlj(._^.) 

path{i,j)^path^set{Nji ,Nj) 

Count(pathset(A''fl, Nj)) 



( 1 ) 

(2) 



The total score in the above example is: 



total^core = 



((0.8,1) Al,0)' + (0-8,l)*(l,0y+l) 

3 



0.8667 



If the total_score exceeds the threshold, it means the observed data matches the 
rule and is suspicious intrusion. During the intrusion detection process, lots of 
rules will be loaded and attached to ontology. The similarity score of each audit 
data toward the relevant rules is evaluated by total_score function. To make the 
matchmaking more efficient, we do not need to calculate all total_score between 
audit data node IV^^^ta rules. One prior fact is that if the total^core 

can exceed the threshold, at least one path^imi^core belong to the same path 
set exceeds the threshold. By utilizing this priori fact we filter the rules before 
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calculating the total_score function which influence the execution speed mostly. 
The matchmaking algorithm is given as below. 

Boolean: IntrusionMatchMakingCVar RuleSet:Rst; AuditData Node:Ndata) 

{ 

Queue : Q ; 

// AttachedNodeSet function gives all the nodes that Ndata is attached to 
For each node N in AttachedNodeSet (Ndata) 

FOR each rule R in Rst 

IF path_simi_score path(N,Rj) > Threshold THEN 

insert (Q, R) ; //If the path_simi_score exceed the threshold, 

The rule R will be insert to queue 
continue ; // Quit inner loop 

END FOR 
END FOR 

//Calculate the total score of the Ndata and rule in Q 
FOR each R in Q 

IF total_score (Ndata, R) > Threshold THEN RETURN TRUE; 

END FOR 
RETURN FALSE; 

} 



3.3 Apply Ontology to Cooperating Detection 

The relationship between value nodes and attribute node can be applied to 
more accurate and efficient matchmaking method for signature detection. At the 
same time, the relationship between the attribute nodes provides fundamental 
information to perform cooperating detection. The parent node indicates the 
locality of the children nodes. For example, if the memory total usage information 
is wanted during cooperating detection process, the ontology tells the detector 
to obtain this information from the system status sensor on the host. Ontology 
provides the machine interpretable knowledge in cooperating detection process. 

To give more accurate alert, detectors try to collect as more information as 
possible before drawing the conclusion. Rules become more complicate when 
signatures information in rule may distribute on different sensors. For example, 
a complete rule in Fig. 1 is composed by three sub-rules. When detector finishes 
first detection on network sensors, it will examine connection status on host. 
The ontology indicates the detector where to get the desired information, and 
then the detector will send a query request to system status monitor on host. 
Before beginning the third step, the detector again learns from the ontology 
about the locality of application logs. This scenario is common in cooperating 
detection. If the backdoor intruder communicates with the backdoor server in 
encrypted commands, the network sensor can only detect some connection using 
strange port and unable to decrypt the content of connection. So the decrypted 
commands can only be obtained from application logs on host. 

Facing the multi-step rule, the detectors perform sub-rule detections indi- 
vidually and the detection result is stored in local database temporarily. Each 
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sub-alert has a TTL (Time to Live) tag. When the sub-alert expires, it will be 
deleted from database. When the last step sub-rule detection is finished, the 
detector cooperates and queries each sub-rule detection result on relevant local 
database. 

4 Experiment 

An experiment is designed to test our approach. The edges and nodes of the on- 
tology are stored in relational database. Because of the infrastructure limitation, 
we only implement cooperating detection among the host and network. A Win- 
Pcap based tool was used to collect network packets and Strace for NT [12] to 
trace NT system call on Windows 2000 Server operation system. Some host logs, 
such as memory usage and network connecting status are also utilized in detec- 
tion. According to the rules of the snort and CVE, we design 21 rules which are 
mostly R2L(Remote to Local) and U2R(User to Root) detection rules because 
R2L and U2R intrusion is not easy to detect while our cooperating approach 
is suitable to detection these intrusion by using both host and network audit 
data. Most of these rules are two-step rule including sub-rule on network and 
host. Then we ran intrusion tool Netbus to evaluate our intrusion detection pro- 
totype. During the simulation we change the communication port of Netbus to 
test our ontology-based matchmaking algorithm. The default port for Netbus is 
20034 and we change the default port to 20038 and 80 separately. We also ran 
snort 2.1.1 to detect simulated intrusion and compared the detection rate with 
our prototype 



Table 1. Detection Results Comparison between Snort and Our Prototype 



Netbus 


Detected by 
Snort 


Detected by 
Prototype 


False Alerts 
by Snort 


False Alerts 
by Prototype 


Default Port 


Y 


Y 






Port:20038 


N 


Y 


38 


12 


Port: 80 


N 


N 







Table 1 compares the detection result and false alerts between snort and our 
prototype. The experiment result shows the superiority of our prototype over 
snort. From the result in the second line, we can find our approach is more 
flexible than Snort because ontology-based matchmaking algorithm is employed 
in our approach. However, if the attack has been changed too much, our method 
shows its limitation. Because of the co-operation of network and host detector, 
the false alarm rate decreases in our method. More complicated experiments 
are on designing, such as intruder’s communication in encrypted commands and 
new intrusion evolved from old one. We expect more conspicuous improvement 
shown by our prototype. 
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5 Conclusion and Ongoing Efforts 

In this paper, we present an ontology to describe relationship among features 
observed by multi-senor. There exist two kinds of nodes in ontology: value 
nodes and attribute nodes. By assigning the weight to the edge between values 
nodes and their parent attributed node, we provide a more flexible matchmak- 
ing method for intrusion detection. At the same time, the relationship between 
attribute nodes and their parent can indicate the locality of desired information. 
An ontology based cooperative detection function is also given in the paper. 

We will employ more sensors to perform multi-sensor cooperating detection in 
our prototype and more complicated experiment will be designed to evaluate our 
approach. Our future work will focus on the improvement of ontology in intrusion 
detection domain and the ontology based matchmaking and cooperation method. 
In our future work, we also intend to design autonomous adaptive method to 
adjust the weights by feedback from detect result. 
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Abstract. With the increasing need of security, cryptographic processing be- 
comes a crucial issue for network devices. Traditionally security functions are 
implemented with Application Specific Integrated Circuit (ASIC) or General- 
Purposed Processors (GPPs). Network processors (NPs) are emerging as a pro- 
grammable alternative to the conventional solutions to deliver high perform- 
ance and flexibility at moderate cost. This work compares and analyzes archi- 
tectural characteristics of many widespread cryptographic algorithms on Intel 
IXP2800 network processor. In addition, we investigate several implementation 
and optimization principles that can improve the cryptographic performance on 
NPs. Most of the results reported here should be applicable to other network 
processors since they have similar components and architectures. 



1 Introduction 

Information security is an indispensable concern owing to the growing demands for 
trusted communication and electronic commerce. For example, a collection of appli- 
cations such as secure IP (IPSEC) and virtual private networks (VPNs) has been 
widely deployed in nodal processing. However, cryptographic algorithms are all 
computational intensive [1]. To address this problem and add security functions to 
network equipment, such as secure gateway, a straightforward approach to achieve 
comparable performance is to implement them in hardware. Unfortunately, many 
security chips or coprocessors are only designed for a few algorithms, while most 
Internet security standards are written to allow flexible in algorithm selection. In 
addition, cryptographic hardware is not cheap or readily exportable. 

On the other hand. Network processors (NPs) are an emerging class of program- 
mable processors used as a building block for implementing packet processing appli- 
cations such as switches and routers [2]. They are highly optimized for packet proc- 
essing and I/O operations. As demands for communication security grow, crypto- 
graphic processing becomes another type of application domain. Currently, there are 
two approaches to add security into NPs: 1) Security functionality is directly built 
into the same silicon as the Network Process. Nevertheless, this method is still in- 
flexible in implementation of multiple algorithms. 2) Implement cryptographic appli- 
cations on NPs using software, which provides a good trade-off between performance 
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and flexibility. Compared to other similar approaches, such as software implementa- 
tion over general-purposed processors (GPPs), this approach has the following advan- 
tages: a) NPs utilize system-on-chip (SoC) technology and have better performance- 
price ratio than GPPs. b) NPs often involve multi-thread and multi-core architecture, 
thus various parallelism can be exploited to boost performance. So, in this paper, we 
focus on software implementation over NPs. 

Clearly, the most challenging work for software implementations is to provide 
some performance guarantees. Most of recent studies [2][3] related with crypto- 
graphic issues on NPs assume a symmetric multiprocessor (SMP) or super scalar 
architecture with multi-level caches, which are more similar with GPPs and ignore 
many characteristics like hardware multi-thread, asynchronous I/O in real-life NPs. 
This work aims to conduct studies of architectural properties of several widespread 
cryptographic algorithms on an actual platfonn - Intel IXP2800 network processor. 
Their implementation and optimization principles have been proposed. The rest of 
this article is organized as follows. In section 2 we briefly review the architecture of 
IXP2800. Then, we detail the selection of cryptographic algorithms and their charac- 
teristics in section 3. Next, we propose several optimization principles and illustrate 
the results through benchmarks in section 4. Finally, we summarize this work and 
offer some suggestions for network processor designs. 



2 Architecture of Intel IXP2800 
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Fig. 1. The hardware architecture of Intel IXP2800 



Closely examining the hardware architecture of IXP2800 shown in Fig. 1 helps to 
elucidate our implementation and optimization. IXP2800 is a 32-bit RISC based 
multi-core system that exploits system-on-chip (SOC) technique for deep packet 
inspection, traffic management and forwarding at high speed. The 700 Mhz XScale 
core is a general purpose processor used for control plane tasks (slow-path process- 
ing). The sixteen 1.4 Ghz microengines (MBs) are data plane PEs, which are con- 
nected in two clusters. IXP2800 has distributed, shared memory hierarchy which 
supports two types of external memory: RDRAM and QDR SRAM. In addition, the 
processor includes a 16KB on-chip Scratch SRAM shared among all MEs and plenty 
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of registers in eonjunetion with a small amount of loeal memory per ME. Memory 
aeeess lateneies have not kept paee with ME proeessing speed. For instanee, the 
minimum read lateney for fastest shared SRAM (Serateh) is 100 ME eyeles. To solve 
this problem, IXP arehiteeture uses 8 “zero thread switehing overhead” hardware 
threads for interleave operation - one thread does eomputation while others bloek 
waiting for memory operations to eomplete. 



3 Selection of Cryptographic Algorithms 

There are three sueh applieation domains for eryptographie proeessing: Publie-key 
eiphers, Private-key eiphers and Flash funetions. In this artiele, publie-key eiphers are 
not studied, sinee many of them are not praetieally applieable to be implemented on 
fast-path of NP. First, large eode storage is required. Besides, publie-key eiphers are 
usually used for short sessions and private key managements, while private -key ei- 
phers are eritieal for long session performanee. Therefore, we will only foeus our 
effort on private-key eiphers and hash funetions. The former ean be further elassified 
into bloek eiphers and stream eiphers. Of the many algorithms, we seleet a subset of 
10 algorithms based on their representativeness, popularity and availability. The 
summaries and eharaeteristies of these algorithms are presented in Table 1. 



Type 


Name 


Block Size 
(bits) 


Round 


Table Size 
(bytes) 


Special Requirements 


Description or Applications 


Block Cipher 


DES [4] 


64 


16 


256 




The first commercial -grade 
modern cipher 


AES [5] 


128 


10 


5120 




802.11i 


IDEA [6] 


64 


9 


0 


Multiply Unit 


PGP, SSH/SSL 


RC5 [7] 


64 


16 


136 


32-bit variable rotation 
engine 


Wireless Transport Layer Security 
in WAP 


RC6 [8] 


128 


20 


176 


Multiply Unit, 32-bit 
variable rotation 
engine 


AES candidate, improved version 
of RC5 


Blowfish [9] 


64 


16 


4168 




Norton Utilities 


Stream 

Cipher 


RC4 [10] 


- 


- 


256 




SSL/TSL, 802-lx 


SEAL [11] 


- 


64+2 


<4096^ 




Disk encryption 


Hash 

Function 


MD5 [12] 


512 


64 


0 




Digital Signature 


SHA-1 [13] 


512 


80 


0 





" The table size of SEAL is variable concerning the output length. Here lists the upper bound. 



Table 1. Selection and characteristics of cryptographic algorithms 



4 Optimization and Benchmark 

4.1 Methodology 

To observe arehiteetural eharaeteristies of eryptographie algorithms and utilization of 
internal re-sourees, as well as the performanee bottleneeks, we eonduet our experi- 
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ments under Workbench 3.1, which is a cycle-accurate simulator of IXP2800. We 
configure MBs working at 1 .4 Ghz, SRAM at 200 Mhz and RDRAM at 400 Mhz. All 
the source codes are compiled using Intel Microengine C compiler 3.1 with optimiza- 
tion level -02 enabled. When encounter operations such as rotation that can not be 
directly expressed using C operators but supported by instructions of IXP, we imple- 
ment them with inline assembly codes. Hence, our optimization principles do not 
focus on specific instructions unique to one target but general features applicable to a 
wide range of NPs. In addition, to test the scalability of parallel optimization we ex- 
ploit up to 8 MBs (64 threads in total) in one MB cluster as described earlier. 



4.2 Instruction Characteristics 
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Fig. 2. Raw code sides and instruction mix 

In this section we present the experimental statistics on instruction distribution of 
these algorithms. These metrics are essential information for understanding their 
dynamic properties and developing implementation and optimization principles. Fig- 
ure 2 illustrates the instruction mix profde and code size of all selected algorithms. 
The following gives indications on their instruction patterns and great differences: 

♦ Most block ciphers and stream ciphers need small code storages (less then 200 
lines of code). The only exception is DBS because it has several complex bit op- 
erations. Hash functions usually need more code storage. 

♦ The most frequently used instructions are ABU instructions, especially simple 
ABU instructions like add, shift and logic. As a whole, ABU instructions occupy 
a significant share of the total instruction mix, which is 79.9% on average 

♦ Branch instructions are less used in every algorithm. The average percent is 
1 .5% (0.8% for unconditional branch and 0.7% for conditional branch). 

♦ For memory and load immediate instructions, there are significant differences 
among all selected algorithms. Stream ciphers and some block ciphers (ABS and 
Blowfish) tend to have a relative high percentage of memory instructions (ex- 
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ceeding 15%) than Hash functions. The average percent of memory instructions 
of the 10 algorithms is only 4%. 



4.3 Optimization Principles and Benchmarks 

We describe our optimization and benchmarks in two subsections. The first one fo- 
cuses on general implementation and optimization principles for single thread within 
one ME. The second one considers multi-thread and performs scalability tests with 
multiple MEs. 

Optimizations for single thread 

Generic implementation and optimization rules 

These rules are NP-independent and most of them are suitable for optimizations 
on GPPs. Their goal is to minimize the overall computation complexity and downcast 
expensive operations. 

♦ Take full advantage of rich register resource and distributed memory hierarchy: 
To minimize access latencies, some frequently used tables should be placed into 
registers and per-ME local memories as much as possible. 

♦ Avoid using complex instructions: Instructions like multiplication which con- 
sume more than one cycle of time should be avoided. 

♦ Pre-calculate part of algorithms: Aside from table initialization and key schedul- 
ing mentioned earlier, immediate data used in inner loops can also be pre-loaded. 

♦ Unroll loops: This can prevent the flush of pipeline and save extra clock cycles. 
Besides, unrolling loops can reduce calculations concerning iteration variables 
and make addressing in arrays more efficiently. 

NP-dependent memory optimizations 

These principles make use of special optimized memory and I/O units on NPs to 
increase ME utilization rate and stretch the computation capacity to the outmost. 

♦ Align memory operations: Access to data size smaller than those supported by 
the hardware incurs overhead (i.e. table access in RC4 and DES). Thus, to 
achieve optimal performance tables should be aligned at hardware boundaries. 

♦ Memory burst read and write: On most NPs, memory burst operations can be 
directly issued at instruction level. IXP2800 allows 32 bytes Scratch/SRAM or 
64 bytes RDRAM burst reference within one instruction. Employing this, mem- 
ory instructions will be further reduced. In our benchmark, reading plaintext and 
writing back ciphertext are all burst at their block size. 

♦ Latency hiding and I/O parallelization: This makes use of asynchronous memory 
operations to ‘hide’ long memory access latencies and improve the ME utiliza- 
tion rate. The core idea is to continue calculations while ‘waiting’ for references 
to be completed. Further, with the mechanism of complete signals and command 
queues, multiple memory references can be issued simultaneously. 

Fig. 3 presents single thread throughputs of selected algorithms applying different 
optimization principles. Related internal statistics of ME are given in Fig. 4. As is 
evident from the plot. Hash functions have the best performance (MD5 1219 Mb/s) 
followed by stream ciphers and block ciphers. DES achieves the lowest throughput 
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(32.2 Mb/s after optimization) because it works at bit level while 32-bit IXP2800 has 
a weak support on bit instructions. 



Throughput of single thread 
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Fig. 3. Single thread performance with different optimization principles 
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The effect of pipeline optimizations seems quite limited. This is because of the short 
pipeline architecture of NPs and low percentage of branch instructions (<2%) in cryp- 
tographic algorithms. The execution statistics also prove this. Most stream and block 
ciphers suffer from low ME utilization rate (‘active’ in Fig. 4), but generic optimiza- 
tions do not take long memory reference latencies into account. On the other hand, 
NP-dependent memory optimizations effectively ‘hide’ them and increase ME utiliza- 
tion rate significantly, especially for algorithms which have a relative high percentage 
of memory operations. For instance, SEAL receives 438% performance The effect of 
pipeline optimizations seems quite limited. This is because of the short pipeline archi- 
tecture of NPs and low percentage of branch instructions (<2%) in cryptographic 
algorithms. The execution statistics also prove this. Most stream and block ciphers 
suffer from low ME utilization rate (‘active’ in Fig. 4), but generic optimizations do 
not take long memory reference latencies into account. On the other hand, NP- 
dependent memory optimizations effectively ‘hide’ them and increase ME utilization 
rate significantly, especially for algorithms which have a relative high percentage of 
memory operations. For instance, SEAL receives 438% performance boost after ap- 
plying memory optimizations. Even though Flash functions have less than 1% mem- 
ory instructions, memory optimizations still yield more speedup than generic optimi- 
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Fig. 4. Internal Statistics of ME 
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zations. From Fig. 4, we also observe that all algorithms except AES, Blowfish and 
SEAL get a near 100% ME utilization rate after memory optimizations. Flence, ME 
computing power is still their bottleneck. On the contrary, not the memory bandwidth 
but long access latency limits the throughput of AES, Blowfish and SEAL. Because, 
none of tested algorithms has its ME ‘stalled’ owing to fullness of target memory 
queues or ME command queues. 



Scalability test 

An obvious way to improve the cryptographic applications on NPs is to use parallel- 
ism. Three types of parallelism can be used: flow-level, block-level and intra-block 
parallelism. We select flow-level and block-level parallelism to see how well the 
overall throughput scale using multiple threads and MEs of IXP2800. All block ci- 
phers are implemented in Cipher Block Chaining (CBC) mode. When encrypted with 
CBC mode, block read/write operations can be paralleled, which are handled by sin- 
gle thread using I/O parallelization. Thus, we assign one hardware thread to one flow 
and no thread communication is required. Fig. 5 presents the overall throughputs of 
the selected algorithms with our multi-ME and multi-thread implementation. 
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Fig. 5. Throughput of selected algorithms with varying number of threads and MEs 



5 Summary 



This study selects ten widely used cryptographic algorithms and analyze their instruc- 
tion architectures on Intel IXP2800 network processor. We suggest several hardware 
improvements can be made on current NPs to help ‘software’ implementations on 
data path PEs: 
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♦ Increase cache size on PEs to hold large tables and lessen the pressure on shared 
memory and bus. 

♦ Enlarge the size of memory queue and command queue to reduce the ‘stalled’ 
possibility of PE. 

♦ Improve communications among different PEs to help intra-block parallelism. 

♦ Adopt a new memory system to shorten the access latency. 

We believe that in combination of these improvements the proposed implementa- 
tion and optimization principles could go a long way to improving cryptographic 
processing performance on network processors. 
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Abstract. Building intrusion detection model in an automatic and online way is 
worth discussing for timely detecting new attacks. This paper gives a scheme to 
automatically construct snort rules based on data captured by honeypots on 
line. Since traffic data to honeypots represent abnormal activities, activity pat- 
terns extracted from those data can be used as attack signatures. Packets cap- 
tured by honeypots are unwelcome, but it appears unnecessary to translate each 
of them into a signature to use entire payload as activity pattern. In this paper, 
we present a way based on system specifications of honeypots. It can reflect se- 
riousness level of captured packets. Relying on discussed system specifications, 
only critical packets are chosen to generate signatures and discriminating val- 
ues are extracted from packet payload as activity patterns. After formalizing 
packet structure and syntax of snort rule, we design an algorithm to generate 
snort rules immediately once it meets critical packets. 



1 Introduction 

Techniques in an intrusion detection system (IDS) can usually be classified into two. 
One is anomaly detection and the other misuse detection. Anomaly detection views 
behaviors deviated significantly from normal profile as attacks, e.g., [6] [7]. Misuse 
detection systems detect attacks by finding the activities matched with attack signa- 
tures, which are drawn from known attacks [3] [4] [5]. This approach is effective to 
detect known attacks but hard to identify new attacks due to the lack of corresponding 
signatures. 

In order to enable misuse detection systems to identify new attacks adaptively, we 
explore a method to construct attack signatures from data gathered by honeypots in an 
automatic and online way. A honeypot is security resource whose value lies in being 
probed, attacked, or compromised [1]. Generally, honeypots play no role in produc- 
tion systems. Hence, traffic to and from honeypots are suspicious, providing us with 
opportunities to get pure intrusive packets. Inspired by this feature of honeypots, we 
can extract patterns of these packets and use them as signatures for misuse detection 
systems. Here, we choose snort [3] as our target system. Since signatures between 
different signature-based IDSs can be mutually translated [8], the present approach 
can be extended to other misuse detection systems. 
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It is usually unnecessary to map each packet to a snort rule. E.G., if a scanning 
packet or ICMP echo request (ping) is captured, it may not provide distinct informa- 
tion to identify abnormal activities as an intruder may fake its source IP address. 
Generally, choosing which part of the packet payload as signatures is difficult since a 
successful attack usually involves a sequence of packets. Therefore, instead of con- 
structing a snort rule for each of them, we only maps chosen packets for system speci- 
fications of honeypots. Moreover, our method can extract discriminating values from 
packet payload as activity pattern rather than use the entire payloads as signatures. In 
the course of building signatures, system specifications are important. They are speci- 
fied by honeypots administrator and made up of system commanders, system calls, 
system configuration files and even some machine instructions. 

In summary, the contributions in this paper are 1) a new usage of honeypots, which 
differs from traditional usage; 2) system specifications based way to recognize critical 
packets and extract discriminating values as activity pattern; and 3) an automatic and 
online method to generate attack signatures. In the rest of paper, § 2 describes our 

requirements to the new usage of honeypots, § 3 discusses our method and § 4 con- 
cludes the paper. 



2 Requirements to Honeypots 

There are many types of honeypots. According to interaction level, they are classified 
into three [1]: low-interaction, medium-interaction and high-interaction honeypots. 
Low-interaction honeypots emulate some services, medium-interaction ones also 
emulate services (but they can response attackers’ request to some extent) while high- 
interaction ones are real operating systems and services. This paper concerns with 
high-interaction, as we need to collect real data coming from intruders instead of 
simply detecting unauthorized scans or connection attempts. 

In a honeynet, production traffic only goes to production network while intrusion 
traffic goes to all hosts since intruders try to attack as many systems as possible. 
When intruders use new attack methods, conventional misuse detection systems in 
production network may be difficult to sense them. However, when they are equipped 
with honeypots that can update signature bases of misuse detection systems, it will 
enable misuse detection systems to adapt new attacks quickly. This paper focuses on 
how honeypots generate snort rules. The issue of how they exchange between honey- 
pots and snorts is not contained in this paper. 

Main requirements to honeypots are data control, data capture and data collection 
[1]. To the honeypots in our scheme, we have two requirements for our purpose: 

1) Correspondence between honeypots and servers in production network. This 
means each honeypot corresponds to a server or several same servers, and vice 
versa. 

2) Security levels between corresponding honeypots and servers. The security level 
of honeypots should be as secure as the corresponding servers. 
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3 Generating Signatures Online 

By generating signatures online, we mean generating snort rules on line once honey- 
pots capture suspicious packets without the intervention of administrators. To this 
end, we consider two issues. One is the specific procedure to map a given packet to a 
snort rule (§ 3.1). The other is the way to choose the packets among those captured 
by honeypots to map and extract the activity patterns from the payload (§ 3.2). 



3.1 Mapping a Packet to a Snort Rnle 

To map a given packet to a snort rule, it needs to describe packet structure and the 
syntax of snort rules formally. We introduce the mapping procedure from a given IP 
packet to a snort rule in this subsection. 

Consider formalizing packet structure. A packet is a stream of raw bits in essence. 
How to interpret this stream is determined by its structure, which is usually specified 
as standards. Snort currently can analyze four types of protocols (IP, TCP, UDP and 
ICMP). Below, we take IP as an example to show how to formalize IP packet struc- 
ture. 

For our purpose, an IP packet structure is described as: <srcIP, dstIP, ttl, tos,frag- 
bits, ipoption, protocol, payload>, where 
—srcIP is the source IP address, and dstIP the destination IP address; 

—ttl is the time to live(ttl) filed, and tos is the type of service field; 

—fragbits is the fragment flag filed, including three bits that can be checked, namely, 
the 

reserved (R) bit, more fragments (M) bit and the don't fragment (D) bit; 

—ipoption is the options field. There eight option types, including strict source rout- 
ing (ssr), 

loose source routing (Isr), IP security option (seq), time stamp (ts), record route 
(rr), end of list 

(eol), no option (nop), and stream identifier (satid). 

—protocol is the type of transport packet being carried; 

—payload is the data encapsulated in IP packet. 

In the above structure, we omit some fields of less interesting for our research, 
e.g., check sum field. In addition, suppose count(ipoption) indicate the count of ipop- 
tion. Then, ipoption[i] (0 < i < count{ipoption)) denotes each option’s type. If p is an 
IP packet, we use p.srcIP to denote /?’s source address, p.dstIP to denote p\ destina- 
tion address, and so on. 

The previous syntax of snort rules is the target of signature constructing. If a non- 
terminal symbol is defined only with a terminal symbol, the non-terminal symbol will 
be replaced by the corresponding terminal symbol when generating a rule. Otherwise, 
we have to determine which terminal symbol should be chosen according to packet 
content. 
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Let us discuss mapping procedure. The major work of mapping is to determine 
values of BNF non-terminal symbols in snort rule syntax. Suppose p is an IP packet. 
Algorithm 1 produces a snort rule from p. In addition, option “msg” “Signature 
from honeypots” is desirable to be included in each generated snort rule so as to 
facilitate management of rule bases, but we omit this option in algorithm 1 for sim- 
plicity. 

Algorithm 1. Mapping procedure between a given packet and a snort rule 
INPUT : an IP packet pt 
OUTPUT: a snort rule 

1 . let ip_pattems = “ttl” “p.h/” “tos” 

2. let ip_pattems += “ffagbits” “P+substring{“'RiyM’\ p.fragbits)+“'"' ; 

3. for to count(p.ipoption)-l, do 

let ip_pattems += “ipopts” ‘"p.ipoption[iY" 

4. if p.protocoli {TCP, UDP, ICMP}, then 

<protocol> ::= “ip”; 

<rport> ::= “any”; 

<options> ip_pattems+“contenf’ “p. payload’" 

5. if/7./?rotoco/=UDP, then 
<protocol> “udp”; 

<rport> “p.payload.destPort”; 

<options> ip_pattems+“contenf’ “p. payload.payload” “;” ; 

6. if/7./?rotoco/=TCP, then 
<protocol> “tcp”; 

<rport> “p.payload.destPort”-, 

let tcp_flags= substring{“\2\] AV-'RSyV” , p. flags)-, 

if tcp_flags= “”, then tcp_flags= “0”; 

let tcp_pattems = “flags” “:”+ tcp flags + “;”; 

<options> ip_pattems+tcp_pattems+“content” 

“p. payload.payload” “;” ; 

7. if/7./?rotocoMCMP, then 
<protocol> “icmp”; 

<rport> ::= “any”; 

let icmp_pattems = “itype” “p. pay load, type” 

let icmp_pattems += “icode” “p.payload.code” 
let icmp_pattems += “icmp id” “p.echojd” 
let icmp_pattems += “icmp seq” “p.echo_seq” 
<options>::=ip_pattems+icmp_pattems+“contenf’ 

“p. payload.payload’ “;” ; 

8. return; 

In the above algorithm, “-I-” is used to concatenate two terminal symbols in BNF; 
“” represents an empty string; substring(str, indicator) extracts a substring from str 
according to the indicator. For example, for ^Mfetr;«g(“12UAPRSF”, p. flags), if 
p.flags^ 0x55, then substring “2ARF” is generated. It should be noted that algorithm 
1 simply map the p’s payload, which will be improved in the following section. 
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3.2 System Specifications, Critical Packets, Discriminating Valnes, and 
Bnilding Signatnre 

Using algorithm 1, one can map fields of a given packet to the corresponding con- 
structs of a snort rule. Nevertheless, it is still not enough to build accurate and effi- 
cient attack signatures. As stated previously, it may be unnecessary to map every 
captured packet to a snort rule and extracting activity pattern from packet payload is 
also a difficult task. In this subsection, we address both problems by turning to the 
knowledge in system specifications of honeypots. 

Different definitions of system specifications can be used for different purposes. 
For example, to describe hardware system for a computer, one can use CPU fre- 
quency, memory and hard disk size, network speed as system specifications. Flere, we 
concern with the system specifications of honeypots to characterize the seriousness 
level of captured packets regarding network security. 

Generally, intruders exploit vulnerabilities of programs to obtain necessary privi- 
lege to implement attacks. In the course of attacks, in particular for attacks on hosts 
like R2L attacks in [9], an intruder uses system calls (even some specific machine 
instructions) to change the execution path and uses system commands to change the 
system state, or modify system configuration files to leave back doors. For examples, 
using WinExec executes the shell code in buffer overflow attack; copying worm or 
Trojan programs in malicious code attack. It can be concluded that, among packets 
captured by honeypots, those containing system calls, commands or configuration 
files will represent more serious intrusive activities than those without such informa- 
tion. 

Therefore, system specifications of a honeypot are defined as a set of system calls, 
system commands, system configuration files, or even machine instructions. C is used 
to denote system specifications. For example, smd.exe, win.ini, WitiExe, dir, cp are 
all elements of C in Microsoft Windows; fork, passwd. In belong to C on unix or 
linux platform; machine instruction “Jump” is also an element of C. 

For each honeypot, its administrator should specify its system specifications ex- 
plicitly. That C is empty means that nothing is considered to be serious. In this case, 
no rules will be generated. On the other hand, if C contains all possible objects on 
honeypots, then almost every captured packet will result in a snort rule. For example, 
if file index.htm of web server on honeypots is included in C, then an unwelcome 
browser to this file will generate a snort rule, and obviously this rule will cause false 
positives in production network. Fortunately, an administrator often knows his system 
very well. In other words, he knows what system specifications are, and which speci- 
fications are more important. Therefore, he can give a reasonable system specifica- 
tions C for a honeypot. 

A list of probes captured by a honeypot in a 30-day period is given in [12], where 
some “ordinary” packets (such as the ICMP echo request packets and DNS version 
query packets) are included. Suppose p is a normal packet and r is the snort rule by 
using algorithm 1 . r will match packets in normal production traffic, which will cause 
false alarms. Therefore, we do not map “ordinary” packet captured by honeypots to a 
snort rale, and instead, only critical packets are chosen to do so. Below we describe 
the definition of critical packets. 
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Definition 1. Suppose pi^ p 2 ~^ Ps---^ Pn is a series of paekets eaptured by honey- 
pots for an attaek. /?,( !<;<«) is a eritieal paeket for the attaek, if the following eondi- 
tions hold: 

a) The payload of/?, eontains c, ceC; 

b) For V/?y(l<7<i),/?, is not eritieal, whieh are ordinary paekets. 

In order to implement attaeks, ordinary paekets are usually used by an intruder to 
gather information about target hosts. Critieal paekets help an intruder to get neees- 
sary privileges or install baekdoor programs. The paekets following eritieal paekets 
represent intruder’s aetivities on honeypots after getting some privileges, sueh as 
ereate direetory, modify files, ete. These paekets also eontain system ealls. Flowever, 
we don’t translate them into snort rules beeause they rely on the eritieal paekets. 
Sinee snort rule is per-paeket based signature, we only define one eritieal paeket in a 
sueeessful attaek. In faet, if misuse deteetion system uses state -transition signatures as 
stated in [4], in whieh every state represents an oeeurrenee of events, we ean ehoose 
many eritieal paekets to build sueh kind of signatures. 

For attaeks on network, paekets may eontain no system ealls or system eommands, 
sueh as tear drops attaeks and syn flood attaek. The goals of these attaeks are usually 
to erash the target systems or make them deny serviees. Construeting signature based 
on eritieal paekets may not eover this kind of attaeks. Flowever, attaeks on network 
only involve paeket headers that have more striet strueture. Thus, they have mueh less 
variations and new attaeks than those on hosts. 

In algorithm 1, the entire paeket payload is used as the argument value of content 
option. Flenee, content option in the resulting rule eontains more than enough bits to 
eharaeterize aetivity pattern. It will result in two drawbaeks: 1) matehing snort rules 
to network traffie will be low effieient beeause there are more bits to deal with; and 
2) it is possible to make false negatives beeause finding longer bit sequenee exaetly in 
network traffie is more diffieult and the ehange of some redundant bits will eause 
variants of attaeks and lead to miss matehing. To avoid these drawbaeks, we should 
identify the representative subsequenee of bits in paeket payload as aetivity patterns, 
ealled diseriminating value of the paeket. For a eritieal paeket, only its diseriminating 
value is used as the argument value of content options. A formal definition of dis- 
eriminating value is given below. 

Definition 2. Suppose /? is a eritieal paeket eaptured by honeypots. The diseriminat- 
ing value of/? is a triple {serv, op, c) or a pair (serv, c), where 

a) serv is the serviee type; 

b) op is the type of serviee operation; 

e) ce C is eontained by /?’s payload. 

If the serviee is based on TCP or UDP, then diseriminating values will take the tri- 
ple form, otherwise the pair form. For example, diseriminating value for an FITTP 
paeket ean be (HTTP, GET, emd.exe); diseriminating value for the attaek based on 
buffer overflow on IP protoeol software ean be (IP, WinExec). In the first ease, field 
serv ean be eharaeterized by the destination port uniquely, sueh as 80 for HTTP and 
23 for TELNET; field op ean be determined by interpreting paeket payload aeeording 
to the paeket format of serviee serv. In the seeond ease, field serv ean be determined 
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by using protocol name, i.e. IP or ICMP. In the both cases, field c can be gotten by 
looking up each element of C and one of the found elements can be used as c. Be- 
cause p.payload maybe contains several elements of C, we use a heuristic method to 
choose the most distinct one. Suppose n elements c, (!<;<«) have been found and 
pos(Cj) is the position of c, in p.payload. Then c, with minimum pos(Ci) is chosen as c. 

Combining critical packets with discriminating values, algorithm 1 can be im- 
proved to be algorithm 2. For simplicity, only the modifications are listed. 



Algorithm 2. Building snort rules with critical packets and discriminating values 
INPUT : critical packet p from a honeypot, System Specifications C 
OUTPUT: a snort rule or null 

1 . Sp^^\ 

2. for Vce C, if found(c, p.payload.payload), then 

Sp^SpKJ {{c,pos{c))}; 

3. if then 

return null; 

4. Let c = d , where (c', pos{c')) sSp and pos(d)^min({pos(c") \ (c", pos(c")) 
^Sp }) ; 



5. if p.protocoli {TCP, UDP, ICMP}, then 



<options> ip_pattems-i-“contenf’ “c” 

6. if p.protocol^VDP, then 



Let op be operation type of service in p.payload.payload; 

<options> ip_pattems-f“contenf’ “o/t” + “contenf’ “c” ; 

7. if p.protocol^HCP, then 



Let op be operation type of service in p.payload.payload; 

<options> ip_pattems-f “contenf’ “o/t” -f “contenf’ “c” ; 

8. if p.protocol=\CMB, then 



<options>::=ip_pattems-i-icmp_pattems-i-“content” “c” ; 

9. return; 

Compared with algorithm 1, algorithm 2 has two differences. The first one is to 
check whether is a critical packet, and if not, it will not generate snort rules for p, 
and if yes, c in discriminating values will be calculated. The second is to use c and op 
as the argument value of content option. As a result, algorithm 2 can produce more 
compact and flexible snort rules to identify the variants of attacks. In addition, we 
ignore the details to interpret the operation type op. 

The above explanations imply that the present method is not simply to translate 
each captured packet to a snort rule. Owing to the limit space, cases to show the ap- 
plication of the present method is not given. 
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4 Conclusions and Acknowledgements 

A usage of honeypots for on-line building snort rules from the data eaptured by 
honeypots has been diseussed. We have analyzed the requirements to honeypots with 
respeet to assuring honeypots to generate useful signatures for deteeting attaeks in 
produetion network. System speeifieations used to reeognize eritieal paekets and 
extraet diseriminating values as aetivity pattern have been explained. Algorithms for 
automatie and online generation of attaek signatures have been derived. This researeh 
is under a grant for the projeet Pervasive Virtual Community in Cyberspaee (R-252- 
000-079-112), Singapore. The paper is in part sponsored by SRF for ROCS, State 
Edueation Ministry, PRC. 
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Abstract. To improve the efficiency and usability of adaptive anomaly 
detection system, we propose a new framework based on Support Vector 
Data Description (SVDD) method. This framework includes two main 
techniques; online change detection and unsupervised anomaly detec- 
tion. The first one enables automatically obtain model training data by 
measuring and distinguishing change caused by intensive attacks from 
normal behavior change and then filtering most intensive attacks. The 
second retrains model periodically and detects the forthcoming data. 
Results of experiments with the KDD’99 network data show that these 
techniques can handle intensive attacks effectively and adapt to the con- 
cept drift while still detecting attacks. As a result, false positive rate is 
reduced from 13.43% to 4.45%. 



1 Introduction 

Intrusion detection is a necessary complement to traditional intrusion prevention 
techniques to guarantee network security. There are two general approaches for 
intrusion detection: misuse detection and anomaly detection [1]. Compared with 
misuse detection, anomaly detection has the advantage that it can detect new 
types of attacks. However, at the same time, it suffers from high false alarm es- 
pecially when normal behavior changes over time. In practice, users, networks or 
system activities cannot be invariant when environment changes over time. This 
phenomenon is called concept drift [2]. To guarantee the accuracy of adapting 
to concept drift while still recognizing anomalous activities, adaptive anomaly 
detection systems have to retrain and update their models with online or newly 
collected data frequently [3] . 

Unsupervised learning algorithms, which train models with unlabelled data, 
are promising for adaptive anomaly detection and have been studied by re- 
searchers in recent years [3, 4, 5, 6]. In [3], a general adaptive model generation 
system to anomaly detection is presented, which uses a probability-based al- 
gorithm for building models over noisy data periodically. SmartSifer, an online 
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unsupervised learning algorithm for anomaly detection based on a probabilistic 
model, adjusts the model after each input datum [4] . More recently, several differ- 
ent unsupervised learning algorithms are applied to anomaly detection, including 
cluster-based algorithm, k-nearest neighbor based algorithm, LOF approach, and 
one-class SVM algorithm [7]. The work most similar to our SVDD-based unsu- 
pervised anomaly detection is one-class SVM based anomaly detection. Those 
algorithms as well as previous probabilistic based algorithms [3,4] make an im- 
portant assumption of attack ratio that attacks can be taken as outliers because 
they are rare and qualitatively different from normal data. Therefore, these al- 
gorithms can use real time data to constantly update or periodically retrain 
their models directly. However, the assumption of attack ratio, i.e. normal data 
greatly outnumber the attacks, limits the application of these algorithms in prac- 
tice because the number of large-scale DoS attacks and probing attacks has been 
increasing alarmingly over the past few years. As a result,the assumption does 
not hold when a burst of intensive attacks causes a large number of anomaly 
instances in a short time. 

In this paper, we present a new framework for adaptive anomaly detection, 
which extends traditional unsupervised method and overcomes the limitations 
of the assumption of attack ratio. In the framework, we introduce the SVDD 
algorithm to anomaly detection. Also, an SVDD-based online change detection 
algorithm is presented to distinguish changes caused by intensive attacks from 
concept drift. With the aid of change detection algorithm, intensive attacks is 
filtered first, and then model retraining is realized safely. 

The rest of this paper is organized as follows. In section 2, we describe the 
SVDD algorithm and introduce the change point detect algorithm; based on 
these algorithms, we then present the SVDD-based adaptive anomaly detection 
framework. In section 3, we discusses our experiments with KDD’99 data. We 
summarize our conclusions in section 4. 



2 SVDD-based Anomaly Detection 

SVDD [8] is an unsupervised support vector machine algorithm for outlier de- 
tection. The goal of SVDD is to distinguish one class of data, called target data, 
from the rest of the feature space. To do this, SVDD learns an optimal hyper- 
sphere around target data after mapping the whole dataset to high dimensional 
feature space. The hypersphere as descriptive model for target data is used to 
classify data into target data or non-target data (also be called outliers). For 
SVDD-based anomaly detection we take normal data as target class and all 
kind of known and unknown attacks as outliers. 



2.1 SVDD 

Let {xi} C y be a training dataset of N data points, with x C Using a 
nonlinear transformation <l> from x to some high dimensional feature space, we 
search for the optimal enclosing hypersphere that is as small as possible while 
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at the same time, including most of the training data. This can be formulated 
as the following optimization problem: 



min B? H 

K.?.a VN 



subject to II — a |p< ^ Oj z = 1, . . . , iV , 

where a is the center of the hypersphere and R is its radius. Parameter v controls 
the tradeoff between the radius of hypershpere and the number of points that it 
contains. It is expected that if R and a solve this problem, the decision function 
f{x) = sgn{R^— || <P{x) — a |p) is determined by location of x in the feature 
space. To solve this problem we introduce the Lagrangian: 

L = R^-Y,{R^ + i,-\\cl>{x.)-a\\^)a.-Y,^A + ^Y.^, ■ ( 2 ) 

i i i 

Setting to zeros the derivative of L with respect to R, a and leads to 

a = '^a,^{xi), ^ = 1 . (3) 

i i 

We then turn the Lagrangian into the Wolfe dual form with kernel function: 

min E aiajK{xi,Xj) ~'^aiK{xi,Xi) (4) 



subject to W Qfi = 1, 0 < Oi < — — , z = 1, . . . , . 

/V 



Throughout this paper we use the Gaussian kernel: K{xi,Xj) = exp{—q || Xi — 
Xj IP) , with width parameter q. The optimal a ’s can be obtained after the dual 
problem is solved. Few special points with 0 < < 1/vN just lie on the surface 

of hypershere and are called support vectors . The first equation of (3) means 
that a can be expressed as the linear combination of ‘P(x) , and then R can be 
computed from any support vector Xk- 

=11 <^(^xk)-a |p= K{xk,Xk)-2'^aiK{xi,Xk) + ^{i,j)aiajK{xi,Xj). (5) 



2.2 Change Detection Algorithm 

The main idea of our change detection algorithm comes from the change point 
detection theory. The objective of change point detection is to determine if the 
observed time series is statistically homogeneous, and if not, to find the point 
in time when the change happens [9]. In our application, real time data from 
sensors are processed into multi-dimensional time series. Compared with tradi- 
tional change detection algorithm Cumulative Sum (CUSUM), our SVDD-based 
algorithm could be easily applied to multi-dimensional series. 
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The idea of SVDD-based change detection is simple. SVDD always try to 
find an optimal hypersphere for the target class, which is the great majority in 
the training data. Thus the region of the hypersphere is a representative of the 
probability density function that generates the target class. Hence, comparing 
the geometries and location of hyperspheres has the equal effect with comparing 
training data that the hypersheres build on. 

Fig.l demonstrates the change detection algorithm. Two adjoining sliding 
windows with same size m are placed on the series to produce adjoining subset of 
data flow. The two windows are moving forward with fixed increment step simul- 
taneously. At time t, subset W\ = {xt-m, ■ ■ ■ , Xt-i} and W 2 = {xt, . . . , Xt+m-i} 
are obtained by the two windows. If we use them as training data to build SVDD 
models independently, we get hypersphere Si defined by center a\ and radius 
i?i for Wi and hypersphere S 2 defined by center 02 and radius i ?2 for W 2 - A 
unexpected change at time t, which means a different distribution of data after 
t, may result in different location and geometries of Si and S' 2 - We use a change 
detection index I (t) to reflect the dissimilarity between and S '2 : 

7(f) =11 ai-a2 II /(i?i+i?2) . (6) 

According to 3.1, ai, 02 , and || oi — 02 || can be computed: 

ai =^^au -<P{xu), tt 2 = '^a 2 j ■ ^{X 2 j) , 
i j 

II 01-02 W'^=y^aiiaijK{xii,xij)-\^auaijK{xii,xij)-2y^aiia2jK{xii,X2j). 

Radii of the hyperspheres can also be computed from their support vectors ac- 
cording to (5). It can be found that, although I{t) is defined in the feature space, 
it can be computed in the input space using the kernel function. 









3 



w. 






Fig. 1. Data series and sliding windows at time t. The right arrow indicates the 
direction of data generation. 



With the continual generation of input data, the two windows are moving 
simultaneously with a fixed increment w that is predefined and I (t) is computed 
every time. We then get a index curve of 7(t), and abrupt changes are easily 
detected whenever the index I{t) peaks or is over a threshold A. 

There are two parameters, w and m, and a threshold A which need to be 
considered. Window size m is selected based on several factors. It should not 
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be too small. Otherwise, it can’t reflect the data distribution, and will get I{t) 
unsteady even for purely normal data flow. Nonetheless a too large m is also 
infeasible and unnecessary because it will increase the computing complexity. 
We are not able to And a universal value of m for any application, but we 
can And a proper value for our application by testing different m in normal 
data flow until getting steady change index values with a little variance. The 
moving increment, w, could range from 1 to m. This depends on the acceptable 
degree of detection delay. The index I{t) measures the extent of change. In our 
application, we assume sudden a burst of large-scale intensive attacks will cause 
abrupt changes in data flow while concept drift raises mild and gradual changes 
in data flow. The threshold A is used to detect abrupt change. If one I{t) in 
index curve goes above A, it indicates an ongoing intensive attack. 

2.3 Adaptive Anomaly Detection Framework 

Based on SVDD algorithm and change detection algorithm, we design an adap- 
tive anomaly detection framework, which consists of four main components: 
preprocessor, change detector, model generator, and anomaly detector. The pre- 
processor transforms the raw network packets from sensors into formatted data, 
and then sends these data to the anomaly detector and the change detector. The 
anomaly detector uses a SVDD model to classify normal and intrusive data and 
raises alarm for ongoing intrusion. The change detector uses change detect algo- 
rithm to detect the intensive attacks and prepares training data for the model 
generator. The training data are stored in database and they are sent to the 
model generator when model update condition is triggered. The model gener- 
ator learns a new model with new training data, and feeds model to anomaly 
detector periodically. 




Fig. 2. Adaptive Anomaly Detection Framework 



3 Experiment 

We conducted experiments on KDD’99 dataset [10] , which is prepared for net- 
work intrusion detection. In the dataset, the network traffic data are connection- 
based. Each connection instance, described by 7 symbolic attributes and 34 
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continuous attributes, corresponds to a TCP/IP connection. The symbolic at- 
tributes must be transformed into numeric attribute to adapt to the SVDD 
algorithm. And attributes scaling is needed in order to ensure that the effect of 
some attributes is not dwarfed by others that have larger scales. The detail of 
these data preprocessing methods is described in our previous paper [11]. 

Experiment 1 (Expl) is designed to evaluate the change detection algorithm 
for detecting intensive attacks. We take SYN flood DoS attack as an example. 
KDD’99 provides a typical 10 percent subset consisting of 494,020 instances, in 
which most instances are attacks. We reserve all of its 97,277 normal instances 
and filter most attacks to get a new set Cl. In Cl, 5 SYN flood attacks are 
reserved, which include more than 100,000 instances. Besides SYN flood attacks, 
all other kinds of at-tacks are less than 900 in Cl. 

We first illustrate how the change index can reflect the influence of SYN flood 
attacks. Fig. 3 displays the change index values obtained on Cl data flow, where 
the sliding windows size m is 3,000 and the increment w for windows is 3,000. 
SVDD parameter v is 0.001 and Gaussian kernel parameter q is 0.02. 




0 5 10 15 20 25 30 35 40 45 50 55 60 65 (* 3 , 000 ) 

Time 



Fig. 3. Change index curve generated in Cl. Top line gives the SYN flood attack 
schedule in Cl. The circle symbol indicates the beginning of attack and square 
symbol indicates the ending. Corresponding change index is shown by bottom 
curve 



In Fig. 3, when the threshold A is 0.53, all changes caused by SYN flood 
attacks are correctly detected, with no false positive. It is natural that not only 
the starting of a SYN flood but also the withdrawing of the attack produce 
peaks in change index curve. These data falling in the two peaks should be 
rejected by the training dataset. In fact, the change detection algorithm not 
only can be used to prepare training dataset, but also can act as an intensive 
DoS attacks detector if we set a proper parameter, such as w. When detecting 
these kinds of DoS attacks, we are most concern with how to detect them as 
soon as possible so that we can take some response actions early to reduce the 
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damage. With a 3,000 increment for windows, the average alarm delay for five 
SYN flood attacks is 342 connections. This means that we become aware of an 
attack in its first 342 connections. We can use less increment step for window 
to get early change alarm. In the set Cl, when w is 500, the average alarm 
delay is 105. Theoretically, a smaller w is good for less delay time, but in fact a 
very smaller w is unpractical because SVDD’s efficiency problem though online 
version of SVDD [12] is employed. 

Experiment 2 is designed to test our adaptive anomaly detection system. The 
experiment compares the performance of static method with adaptive learning 
strategy. On the basis of the Expl, Cl is filtered and generate a new dataset C2 
in which attack instances are about 1%. 

In order to compare the adaptive manner with the static manner, first 20,000 
normal records of C2 are extracted to get an initial training dataset. An initial 
model is build based on this initial training dataset. Exp2-1 is an experiment for 
adaptive manner which updates the model periodically. In this mode, a retrain 
period for model training and update is set. First the initial model is used, then 
at the end of every period, a new model is trained using data collected in this 
period and the old model is replaced with the new generated one. In Exp2- 
1, retrain period is set 20,000. Exp2-2 is a static manner experiment without 
updating the model. It just uses the initial model to detect the rest of C2 set, 
and the model remains unchanged during the detecting process. 



Table 1. Results of Exp2-1 and Exp 2-2 



Experiment 


Elapsed time (thousand of instances) 




1 20 


1 40 


1 60 


1 80 (all) 




1 False positive rate(%) 


Exp2-1 


4.87 


4.33 


5.39 


4.45 


Exp2-2 


4.87 


6.43 


9.52 


13.43 




1 Detection rate(%) 


Exp2-1 


96.33 


92.66 


88.76 


89.27 


Exp2-2 


96.33 


95.17 


92.80 


92.35 



In initial dataset, the parameter v and q are selected through cross validation 
to obtain the minimum false positive rate. We set v 0.01 and q 0.5 when false 
positive rate is 1.06%. In Exp2-1, the two parameters are unchanged. Table 1 
shows the detection rates and false positive rates for Exp2-1 and Exp2-2 over 
elapsed time, i.e. more instances are seen. The static model in Exp2-1 is able to 
detect 92.35% of the attacks in the dataset C2 at the end of all data. However, the 
false positive rate is increasing with time, and reaches 13.43% at the end, which 
indicates the influence of concept drift in (72. At the time, the adaptive manner 
(Exp2-2) continuously adapts to the concept drift and thus improves the false 
positive. Consequently, it generates significantly less false positive rate(< 5%) 
as well as a comparable detection rate with static model. 
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4 Conclusion 

Because of the limitation of application and the difficult of deployment of the 
previous adaptive system, in this paper, we present a new framework for adap- 
tive anomaly detection based on SVDD. In order to implement the automatic 
collection of training data for model update, we design a change detection al- 
gorithm to find intensive attacks and to filter them from real time data. Then 
detection models are periodically regenerated with online collected training data. 
Our system significantly reduces human intervention as well as deployment costs. 
Results of experiments with the KDD’99 network data and preliminary analysis 
show that it can adapt to the network behavior changes while still detect attacks. 
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Abstract. In this paper, we improve the GSM (Global System for Mo- 
bile Gommunications) authentication protocol to reduce the signaling 
loads on the network. The proposed protocol introduces a notion of the 
enhanced user profile containing a few of VLR IDs for the location areas 
where a mobile user is most likely to visit. We decrease the authentica- 
tion costs for roaming users by exploiting the enhanced user profile. Our 
protocol is analyzed with regard to efficiency and is compared with the 
original protocol. 



1 Introduction 

GSM, an european standard for the second generation mobile networks, intrin- 
sically provides three security functions[l][2]. 

~ Authentication for subscriber’s identity 
~ Anonymity for subscriber’s identity 
— Gonfidentiality for data on the radio path 

While providing the security functions listed above, the GSM networks suffers 
from excessive signaling loads for the transmission of authentication parameters. 
This indicates that the GSM authentication protocol requires significantly high 
costs while a number of communicating mobile users frequently move through the 
location areas. Gonsidering the tremendously growing mobile users, this prob- 
lem must become critical[2]. In this paper, we improve the GSM authentication 
protocol to reduce the costs of roaming user authentication. The basic concept 
of our protocol is to utilize the enhanced user profile containing a group of 
VLR (Visitor Location Register) IDs. Among the VLRs, the master VLR is de- 
fined as a VLR to which a mobile user performs location registration, while the 
slave VLR is a VLR to which a mobile user performs location update. It means 
that a mobile user moves from the location area covered by master VLR to the 
other areas covered by slave VLRs. In our protocol, the master VLR manages 
several slave VLRs to reduce the signaling traffics for authenticating roaming 
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users. This paper is organized as follows. Section 2 summarizes the operations 
of GSM authentication protocol and describes its drawbacks. Section 3 proposes 
an improved GSM authentication protocol without modifying the fundamentals 
of GSM systems. In Section 4, we show that our protocol improves the original 
protocol with regard to efficiency. Finally, Section 5 concludes this paper. 



2 GSM Authentication Protocol 

GSM authentication protocol utilizes the challenge-response mechanism with 
secret key protocol which is used for either the mobile user authentication or 
the session key generation. In GSM systems, the communicating users store the 
session key in the SIM (Subscriber Identity Module) card and the network stores 
the key in the secure database called AuG (Authentication Genter)[4]. SIM is 
unique to each mobile user and can be inserted to the mobile terminal. Also, 
SIM contains the service related information for each mobile user and a unique 
128 bit key Ki that is used to identify and authenticate itself to the network [5]. 



2.1 Original Protocol 

The detailed operations of original protocol are summarized as follows [7] [8]. 

1. While entering a new location area, a mobile user sends an authentication 
request which involves the TMSI (Temporary Mobile Subscriber Identity) 
and LAI (Location Area Identity) to the VLR. 

2. The VLR checks the TMSI and derives an IMSI (International Mobile Sub- 
scriber Identity) from the TMSI. Then, the VLR forwards it to the HLR. 

3. The HLR/ AuG generates a 128 bit RAND corresponding to the received 
IMSI. Then, the HLR derives a 32 bit SRES (Signed RESult), 64 bit Kc 
through A3, A8 algorithm. This is done by using the RAND and the private 
key of the mobile user, results of which are returned to the VLR with RAND. 

4. The VLR chooses a RAND from one of the triplet and forwards it to the 
mobile terminal. 

5. The mobile terminal generates SRES, Kc through A3, A8 algorithm. This 
is also done by using the RAND and Ki stored in the SIM. The session key 
Kc is kept for the secure communication and the SRES is sent to the VLR. 

6. Finally, the VLR compares it with the SRES sent from the HLR. If the two 
are equal, the mobile user is regarded as legal and the user authentication is 
completed. 

2.2 Drawbacks of the GSM Authentication Protocol 

In order to perform the user authentication, the VLR must contact to the HLR 
since the private key of a mobile user is stored in the HLR. The problem is that 
the triplets provided by the HLR are sent to the VLR via various intermediate 
links and this incurs dramatic signaling traffics to GSM networks. Besides, the 
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excessive stream of signaling traffics may increase the authentication delay. As 
we will see, large portions of the GSM network traffics are generated from the 
consistent signaling between mobile users and network [5] [9]. Fig. 1 depicts the 
flow of signaling messages generated while a mobile user performs either the 
location registration or update. 




Location Registration 



Location Update 



Location Update 



Fig. 1. GSM authentication protocol for roaming users 



Since we explained the authentication procedure during location registration 
in Section 2.1, we skip this in this section. Instead, we illustrate the authenti- 
cation procedure during location update. A mobile user first performs location 
registration to VLRl. As a result of successful authentication, VLRl allocates 
a TMSIl to the mobile user. In this point, we assume that the mobile user sub- 
sequently moves to the area covered by VLR2. Then, following steps of user 
authentication are performed. 

2-1 The mobile user sends a TMSIl to VLR2. 

2-2 VLR2 forwards the TMSIl to the location area managed by VLRl. 

2-3 VLR2 receives an IMSI, together with a few of authentication triplets from 
VLRl. 

2-4 VLR2 chooses a RAND from one of the authentication triplets and sends it 
to the mobile terminal. 

2-5 The mobile terminal calculates a SRES and send it back to VLR2. VLR2 
then performs the user authentication. 

2-6 After the user authentication is completed successfully, VLR2 sends a loca- 
tion update message to HLR and updates the location of the mobile user. 
2-7 HLR returns an acknowledgement for the user’s location update to VLR2. 
2-8 VLR2 assigns a TMSI2 to the mobile terminal. 

2-9 HLR finally transmits a TMSIl cancellation message to VLRl. 

In case that the mobile user moves to the location area managed by VLR3, 
the similar steps of authentication procedure are performed. 
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3 Proposed Protocol for GSM Roaming Users 

3.1 Basic Idea 

In order to reduce the signaling traffics for roaming user authentication, our pro- 
tocol exploits the enhanced user profile. The enhanced user profile is maintained 
in the HLR and the mobile terminal. The location areas corresponding to the 
VLR IDs in the enhanced user profile can be selectively chosen by mobile users at 
the enrollment to the service provider. Also, the areas can be adaptively modified 
by mobile users’ preference. In our protocol, the HLR in advance knows of the 
location ares where each mobile user is most likely to visit. Besides, the master 
VLR manages a group of slave VLRs. In detail, the master VLR transmits au- 
thentication triplets to the slave VLRs and maintains the state of mobile users’ 
movement within the areas specified in the enhanced user profile. Thus, our pro- 
tocol does not require a location update to the HLR as long as a mobile user 
roams within a group of areas indicated by the enhanced user profile. Instead, the 
location update of mobile user is sent from the slave VLRs to the master VLR. 
For doing this, the TMSIs assigned from the areas in the enhanced user profile 
must contain the ID of master VLR. As a result, the slave VLRs can identify 
the master VLR and notify the master VLR of the mobile user’s location change. 

Our protocol must be installed on the HLR and VLR. As a result, the HLR 
knows of the master VLR by checking the VLR ID in the enhanced user profile 
when a mobile user performs location registration. The HLR regard the other 
VLRs in the enhanced user profile as slave VLRs and delegate its role to the 
master VLR. In addition to the fields for the VLR IDs in the enhanced user 
profile, we consider another field for the case where a mobile user moves to the 
area that is not indicated by the enhanced user profile. In this case, the VLR 
ID for the corresponding area is inserted into this field and the original protocol 
is performed. Specifically, the current VLR receives the authentication triplets 
from the master VLR and performs the authentication procedure. When the user 
authentication is completed successfully, the VLR transmits a location update 
directly to the HLR for the consistency between itself and HLR. 

3.2 Protocol Description 

Fig. 2 illustrates the functional steps of our protocol and describes the flow 
of signaling messages while a mobile user crosses several areas specified in the 
enhanced user profile. The main steps that must be focused on are the authen- 
tication procedure during location update. As presented in Fig. 2, our protocol 
utilizes the enhanced user profile containing three VLR IDs, i.e., VLRI, VLR2 
and VLRS. The noticeable difference between the original protocol and our pro- 
tocol is that the VLRI acts as master VLR and VLR2/VLR3 act as slave VLRs. 
In case that a mobile user performs location registration to VLRI, the original 
protocol is performed and this is depicted in the step 1-1 through 1-8. If the mo- 
bile user moves to the area covered by VLR2, following steps of authentication 
procedure are performed. 
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Location Registration Location Update Location Update 



Fig. 2. Proposed GSM authentication protocol for roaming users 
2-1 The mobile user sends a TMSIl to VLR2. 

2-2 VLR2 forwards the TMSIl to the area where a mobile user performs location 
registration, i.e., the area managed by VLRl. 

2-3 VLR2 receives an IMSI along with the authentication triplets from VLRl. 
2-4 VLR2 chooses a RAND from one of the triplets and sends it to the mobile 
terminal. 

2-5 The mobile terminal calculates a SRES and returns it to VLR2. The SRES 
is used to check the validity of the mobile user. 

2-6 After the user authentication is completed successfully, VLR2 sends a loca- 
tion change notification to VLRl. 

2-7 VLRl forwards an Ack to VLR2. 

2-8 VLR2 allocates a new TMSI, i.e., TMSI2 to the mobile terminal. 

2-9 VLRl sends a TMSIl cancellation message to itself, i.e., the area that pre- 
viously allocated a TMSI to the mobile terminal. 

When the mobile user moves to the area managed by VLR3, similar steps 
of authentication procedure are performed and this is presented in the step 3-1 
through 3-9. In summary, the proposed protocol can trace the location areas 
where a mobile user is most likely to visit. Basically, our protocol exploits the 
concept of mobile users’ local movement, i.e., the local mobility. Specifically, 
large number of mobile users can be regarded as commuters roaming within 
a few of limited areas including home, office, school, etc. This stems from the 
fact that the mobility pattern of most roaming users could be quite routine and 
their roaming coverage might be confined to a few areas. By using the localized 
feature of user roaming, our protocol does not entirely rely on the HLR for 
whole steps of user authentication. Thus, the proposed protocol can perform an 
efficient roaming user authentication. 
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4 Performance Analysis 

We regard the signaling loads as the most essential criterion to evaluate the per- 
formance of authentication protocol. While focusing on the criterion, we perform 
a few of simulations and compare the performance of our protocol with that of 
original protocol. For doing this, we use the Fluid flow mobility model [2]. 



4.1 Fluid Flow Mobility Model 

The Fluid Flow mobility model basically assumes the following parameters. By 
using these parameters, we can obtain the numerical results for the signaling 
traffics during the roaming user authentication [2] [11]. 

~ Average speed of mobile users : v = 6.3km/ hr 

— Average density of mobile users : p = 267 / km^ 

— Moving direction of mobile users : [ 0, 27 t ] 

— Border length of a location area : I = 8.65km 

— Total border length of a location area : L = 34.6km 

— One HLR for 64 location areas, each controlled by one VLR 



At first, we can compute the number of location registrations to VLR. 

267 * 6.3 * 34.6 



RReg,VLR = p * V * L = 



= 5.14/s 



( 1 ) 



36007T 

We can also derive the number of location registrations generated at HLR. 



RReg.HLR = RReg,VLR * Number of Areas = 5.14/ s * 64 = 328.96 / s (2) 



4.2 Numerical Results for the Authentication Signaling Loads 

We define the parameters to calculate the costs of roaming user authentication: 



TChv ■ Transmission Cost between HLR and VLR 

TCvv '■ Transmission Cost between VLR and VLR 

TChm ■ Transmission Cost between HLR and Master VLR 

TCms ■ Transmission Cost between Master VLR and Slave VLR 

TCvm '■ Transmission Cost between VLR and Mobile Terminal 

PCh, PCv '■ Processing Cost at HLR, VLR 

PCm, PCs : Processing Cost at Master VLR, Slave VLR 



According to the signaling flows described in Fig. 1 and 2, the authentication 
costs for the original protocol and our protocol can be derived as: 



AC originaRreg — ATCvm + ^TChv + ^PCv + ^PCh (3) 



ACproposed.reg = ^TCyM + “^TChM + 3PCy + 2 PCh ( 4 ) 
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-^Coriginal^up — ATCvm + ^TChv + 2TCvv + 6PCv + PCh (5) 

ACproposed_up = ^TCyM + 4TCms + TCyv + 4PCs + 2PCm (6) 

We assume that the transmission cost is proportional to the distance between 
the network entities. We also assume that the transmission cost through wireless 
link is much higher than the transmission cost through wired link. Additional 
parameters are depicted as follows. 

— p : Residential probability to the areas indicated by the enhanced user profile 

— r : Registration ratio to the location areas 

By using the aforementioned parameters, we present a few of simulation re- 
sults. Fig. 3 shows the authentication signaling loads with the varying residential 
probabilities. We obtain the simulation results for the time interval lOO(sec). As 
presented in Fig. 3, our protocol ensures better performance than the original 
protocol as p increases. When p is 0, the performance of both protocol is equal 
since in this case, the mobile users move through the areas that are not specified 
in the enhanced user profile. The gap of performance between the two protocol 
gets into maximum when p is 1. This stems from the fact the the mobile users 
roam only within the areas indicated by the enhanced user profile. 





Fig. 3. Results with varying p where Registration Ratio = 10%, 30% 



Fig. 4 shows the results with the varying registration ratio. Similarly, our 
protocol ensures better performance than the original protocol as r decreases. As 
r decreases, the authentication cost during the location update is incrementally 
added to the total signaling loads. By examining all the simulation results, we 
can conclude that our protocol generates less signaling traffics than the original 
protocol while authenticating the roaming users. 

5 Conclusion and Future Work 

In this paper, we design and analyze the improved authentication protocol for 
GSM roaming users. Our protocol aims at reducing the excessive signaling traf- 
fics for roaming user authentication. For doing this, we exploit the enhanced user 
profile and utilize the localized features of users’ mobility. As a result of perfor- 
mance evaluations, we present that our protocol avoids the inefficiency of original 
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1 11 21 31 41 51 61 71 ai 91 



Fig. 4. Results with varying Registration Ratio where p = 100%, 60% 



protocol where whole steps of roaming user authentication procedure must rely 
on the HLR. Our protocol does not make much modifications to the fundamen- 
tals of GSM systems. Thus, our protocol can satisfy the security requirements of 
GSM authentication protocol. Specifically, our protocol can maintain the similar 
level of security implications as original protocol. 
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Abstract. It is an important issue for the security of network that how to detect 
new intrusions attack. This paper investigates unsupervised intrusion detection 
method. A distance definition for mixed attributes, a simple method calculating 
cluster radius threshold, a outlier factor measured deviating degree of a cluster, 
and a novel intrusion detection method are proposed in this paper. The experi- 
mental results show that the method has promising performance with high de- 
tection rate and low false alarm rate, also can detect new intrusion. 



1. Introduction 

The signature-based deteetion methods and supervised anomaly deteetion methods 
ean only deteet previously known intrusion, at same time signature database and 
labeled data has to be manually proeessed. To upper flaws, unsupervised anomaly 
deteetion methods have been addressed reeently [1-4]. However, existing unsuper- 
vised methods have some problems: (l)They eannot deal with eategorieal attributes 
or deal with eategorieal attributes too eomplieatedly. (2)The results of deteetion are 
sensitive to the parameter, and it is diffieult to seleet the parameter. (3) It isn’t rea- 
sonable that the objeets in the small elusters are labeled anomalous. This paper is 
mainly eoneemed with these problems. 



2. Notation and Deflnition 

Suppose dataset D is featured by m attributes( m,- eategorieal and eontinuous), 
eategorieal attributes before eontinuous attribute, D. is the set of i-th attribute value. 
Definition 1: Given a eluster C and a, e D, , then the support of a, in C with respeet 
to D. is defined as Sup (a i) = ^{object\object gC, object. =a,}|. 

Definition 2: Given a eluster C, the eluster summary information (CSl) for C is de- 
fined as: CSI = {kind, n. Summary } , kind for the type of the eluster C with ‘normal’ or 

‘attaek’, n for the size of the eluster C, Summary deseribes the frequeney information 
for eategorieal attribute value and the eentroid of numerieal attributes. 

Summary = {< Stat. , Cen > ^Stat^ = { {a. ))|i3 . e D. } ,1 < j <m^, Cen = , p^^+ 2 ’ ’ ‘ ' Pmc+m ^ ) f 

H. Jin et al. (Eds.): NPC 2004, LNCS 3222, pp. 459-462, 2004. 
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Definitioii 3:Given clusters C, Q , C 2 and objects p = {pi\i & ,q = {q-\i^\\,m\] 
(1) The distance between objects p and cluster C is defined as d(p,C) = {d^, +df 2 )/m, 



where dc=m^-^ Sup^^o, {Pi ) /|C| X \P ‘^ I • 

/=1 V i=m(-+\ 

(2) The distance between clusters C, and Cj is defined as rf(C[, C 2 ) = +df 2 )/m , 



dr = nir 



IC.HC; 



C^\D, {Pi)-Supc^^o.iPi) = ' 



2 I i=l peC, 



|C.|-|C: 



C\D, {qi)-Supr^^„(q.)-. 



2 I i=l ?eC2 



df^ — , 



s I 



Definition 4: LetC = {C,,C 2 ,- -,Ci} is the result of clustering on training data D, 
The outlier factor of cluster C, is defined as harmonic means of distances between 

1 



cluster C, and other clusters: OF(C, ) = (i - 1) / ^ 






3, The Clustering-Based Intrusion Detecting Method 

3.1 Clustering 

We use the least distance principle to cluster dataset into hyper spheres with almost 

the same radius [3]. The details about the clustering are described as follows. 

Step 1 : Initialize the set of clusters, S, to the empty set, read a new object p. 

Step 2: Create a cluster with the object p. 

Step 3: If no objects are left in the database then turn to step 6, else read a new 
object p, find the cluster C in S that is closest to the object p. In other 
words, find a cluster C in S, such that for all C in S , {p, C)<d {p, C) . 

Step 4: If d{p,C) > r turn to step 2, where r is threshold. 

Step 5: else merge object p into cluster C and, modify the CSI of cluster C. 

Step 6: Stop. 

3.2 The Intrusion Detection Method 

Our intrusion detection method is composed of modeling and detecting module. 

(1) Setting up model 

Step l,Clustering:Cluster training set T, and produce clusters 
C = {C,,C2,---,CJ. 

Step 2, Labeling clusters: Sort clusters C = {C^,C 2 ,-",C^} and make them meet 
OF(Cj) < 0F(C2) <■■■< OF(C^) . Search the smallest h, , which satisfies 
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^|C,|/|r,| > £, and then label clusters Cj , Cj , • • • , C/,, with ‘normal’ while 
1=1 

with ‘attack’. 

Step 3, Producing model: The model is made up of the cluster summary informa- 
tion and the radius threshold r. 

(2) Detecting attack 

For any object p in testing set , find a cluster C,^ which is closest to p, if 
d {p, Cj^ )<r then classify p by the label of C,-^ , else regard p as new attack. 



3.3 Tuning Parameters 

(1) Selecting threshold r 

According to the process of clustering, threshold r should greater than inter-cluster 
distance and less than intra-cluster. So we guess logically that r should be close to 
average distance of any pair’s objects. The details are described as follows: 

©Choosing randomly NO pairs of objects in the dataset D. 

©Computing the distances between each pair objects. 

@ Computing the average EX and standard deviation DX of distances from @. 

@ Selecting r in the range of [EX -0.25DX,EX]. 

(2) Selecting parameter e 

1 - £ is the approximation ratio of outlier to whole dataset. A rule of thumb in sta- 
tistics is that the proportion of contaminated data in a dataset is usually less than 5% 
and almost always less than 15%, so we general let s be about 0.95. If we have prior 
knowledge on the ratio, we may select e more accurate. 



4, Experimental Results 

The 10% subset of KDDCUP99[6] is used to evaluate our algorithm. We divide the 
subset into two subset PI, P2. PI contains 40459 records (96% normal). P2 contains 
some unknown attacks type in the PI. We set up model on training set PI, and test 
model on testing set P2. By computing, EX=0.063, DX=0.043, let e =0.95, the table 
1 show detection result with distinct r. The table 2 shows contrast of results on dataset 
KDDCUP99 among methods. 



Table 1 Detection result with distinct r 





r=0.031 


r=0.042 


r=0.052 


r=0.063 


r=0.073 


r=0.084 


Total detection rate 


98.79% 


98.53% 


98.47% 


93.33% 


93.18% 


27.69% 


False alarm rate 


1.24% 


0.12% 


0.40% 


1.37% 


1.36% 


0.43% 


Detection rate for 
unknown attack 


37.40% 


33.60% 


33.56% 


58.92% 


57.81% 


21.24% 
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Table 2 The contrast of results with different methods on dataset KDDCUP99 



Ref 


Detection rate 


False alarm rate 


Detection rate for unknown attack 


[1] 


55%-82% 


0.8%-4.9% 


/ 


[2] 


43.1%-75.2% 


/ 


/ 


[3] 


35.7%-88% 


1.44%-8.14% 


/ 


[4] 


28%-93% 


0.5%-10% 


/ 


[5] 


91.8% 


0.5% 


/ 


Our method 


27.69%-98.79% 


0.4%-1.37% 


21.24%-58.92% 



5. Conclusion 

In practice, unsupervised detection methods are important, because these methods can 
be applied to raw collected system data and do not need to be manually labeled which 
can be an expensive process. In this paper, we presented a new unsupervised intru- 
sion detection method,the method needn’t any prior classification about training data 
and the knowledge about new attacks. The experimental results show that our method 
outperforms the existing methods on accuracy. 
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Abstract. Remote mirroring ensures that all data written to a primary 
storage device are also written to a remote secondary storage device to 
support disaster recoverability. In this study, we designed and imple- 
mented a storage-based synchronous remote mirroring for SAN-attached 
storage nodes. Taking advantage of the high bandwidth and long-distance 
linking ability of dedicated fiber connections, this approach provides a 
consistent and up-to-date copy in a remote location to meet the demand 
for disaster recovery. This system has no host or application overhead, 
and it is also independent of the actual storage unit. In addition, we 
present a disk failover solution. The performance results indicate that 
the bandwidth of the storage node with mirroring under a heavy load 
was 98.67% of the bandwidth without mirroring, which was only a slight 
performance loss. This means that our synchronous remote mirroring has 
little impact on the host’s average response time and the actual band- 
width of the storage node. 



1 Introduction 

Remote mirroring ensures that all data written to a primary storage are also 
written to a remote secondary storage to support disaster recoverability. It can 
be implemented at various levels, including the file system, the volume manager, 
the driver, the host bus adapter (HBA) and the storage control unit[l] [14]. 
In general, there are two locations at which mirroring are implemented: the 
storage control unit and the host. Each location has its own advantages and 
disadvantages. 

IBM’s Peer-to-Peer Remote Copy (PPRC)[1] and EMC’s Symmetrix Remote 
Data FacilitySRDF[2] use a synchronous protocol at the level of the storage 
control unit. Today’s storage control units contain general-purpose processors, 
in addition to special-purpose elements for moving and computing blocks of data. 
Therefore, remote mirroring is provided by storage subsystems as advanced copy 
functions. But it costs a lot and depends on the actual disk’s storage subsystem. 
Veritas’s Volume Replicator [3] is a remote mirroring solution at the level of the 
host’s device driver. It intercepts write operations at the host device driver level 
and sends the changes to a remote device. So it is kept independent of the actual 
storage unit. However, it takes a toll on the host CPU cycles and communication 
bandwidth, and it is difficult to manage because it needs to interact with all the 
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hosts where replication software is installed. NetApp’s SnapMirror [4] is a remote 
mirroring solution using an asynchronous protocol at the level of the host’s file 
system, but it is not suitable for block-level I/O access in the SAN environment. 

In this study, we designed and implemented synchronous remote mirroring 
at the level of the storage control unit for Tsinghua Mass Storage Network 
System (TH-MSNS) [5] [6] , an implementation of the FC-SAN. Based on the high 
bandwidth and long-distance linking ability of dedicated fiber connections, this 
approach provides a consistent and up-to-date data copy in a remote location to 
meet the demand for disaster recovery. This implementation of remote mirroring 
has no host or application overhead, and it is also independent of the actual 
storage unit. In addition, we present a failover solution for disk failure. The 
performance results indicate that our synchronous remote mirroring does not 
have a significant effect on the average command response time or on the actual 
bandwidth of the storage node. 



2 Introduction of the TH-MSNS 

The TH-MSNS[5] [6] is an implementation of an FC-SAN. In the TH-MSNS, 
the storage nodes provide storage services. A storage node is composed of a 
general-purpose server, SCSI disk arrays, and fibre channel adapters, and it has 
a software module named the SCSI target simulator [7] [8] running on it. By 
using the SCSI target simulator to control the I/O process to access disk arrays, 
the TH-MSNS can implement basic functions of FC disk arrays while only using 
general SCSI disk arrays. Because of this, it is low cost, highly flexible and can 
achieve considerable performance [5]. Figure 1 shows the I/O path of the TH- 
MSNS. The file system and SCSI driver in the host converts the application’s I/O 
requests to SCSI commands and data, and then the FC HBA driver encapsulates 




Fig. 1. The I/O path of the TH-MSNS 




An Implementation of Storage-Based Synchronous Remote Mirroring 465 



them into FC frames using Fibre Channel Protocol (FCP)[9] and sends them to 
the SAN. When the FC HBA on the storage node receives the frames, the FC 
target driver transforms them back into SCSI commands and data. Then the 
SCSI target simulator, a kernel module running on the storage node, queues 
and fills up the SCSI requests’ structures, and finally prompts the API of the 
SCSI driver layer to commit the SCSI requests to the SCSI subsystem. After the 
SCSI subsystem has completed the SCSI commands, the SCSI target simulator 
returns the command status or data to the host. Therefore, by the coordination 
of the SCSI target simulator and the FC target driver, the SCSI disk arrays of 
the storage node can be directly mapped to the host as its own local disks. So 
the storage node is equal to the storage control unit in the SAN environment at 
the basic function of storage service. 

3 Design and Implementation of Remote Mirroring for 
the TH-MSNS 

3.1 The Architecture of Remote Mirroring 

Figure 2 shows the architecture of remote mirroring for the TH-MSNS. We 
added a remote storage node with the same structure and configuration as the 




Host Local Storage Node Remote Storage Node 



Fig. 2. The architecture of remote mirroring for TH-MSNS 



local storage node. By adding an FC HBA in the local storage node to connect 
point-to-point with the remote storage node, the remote storage node’s disks 
are regarded as the local storage node’s own disks. Therefore, the storage node 
and remote storage node can constitute a mirrored pair. The write commands 
from the host can be mirrored to the remote storage node by the SCSI target 
simulator on the local storage node. So the data on the storage node can be 
mirrored to the remote storage node. The remote storage node can be located 
up to 10km away from the local storage node using fibre channel technology, and 
this distance can also been extended by using extenders. 
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The advantages of this approach of remote mirroring are as follows. 

• It has low cost and high flexibility because the remote mirroring is imple- 
mented by software modules, not by special hardware. 

• The actual storage unit is independent because the host’s SCSI commands 
are redirected to the storage node’s SCSI driver layer, which provides a 
common API for I/O access and hides the detail of the low-level storage 
subsystem. 

• The mirroring process is transparent to the host, and does not consume any 
of the host’s resource. Moreover, the remote storage node is also transparent 
to the host, because the SCSI target simulator can prevent the remote disks 
being mapped by the host. 

3.2 Synchronous Mirroring 

Synchronous remote mirroring writes data not only to a local disk but also to a 
remote mirror disk at the same time. The acknowledgement is not be sent back 
until all the data is written to both disks. In many cases, both copies of the data 
are also locally protected by RAID [10]. This approach provides a consistent and 
up-to-date copy in a remote location to meet the demand for disaster recovery. 

In the mirroring architecture we presented, the SCSI target simulator on 
the local storage node receives SCSI commands and data from the FC target 
driver. Then it converts each write command into a pair of write commands to 
mirrored disks, queues them into different request queues, and finally prompts 
the SCSI driver to process them. Actually, the local write command is sent 
to the local disk, and the remote write command is sent to the ’network’ disk 
mapped by the remote storage node. The remote write command is received 
by the SCSI target simulator on the remote storage node through the point- 
to-point fiber connection. The acknowledgement is not sent back to the host 
until both the local and remote write commands have been completed. Figure 
3 shows the local and remote I/O path of synchronous mirroring. The dashed 
line represents the I/O path of the remote write commands, and the solid line 
represents the I/O path of the local read/write commands. In order to reduce 
the command processing time, local and remote write commands can share the 
same data buffer. It is not necessary to apply the data buffer for the remote 
write command; the data pointer only has to be pointed to the data buffer of 
the local write command. Local write and remote commands can be committed 
to the SCSI driver at almost the same time, and can be executed by different 
HBAs concurrently. In this way, commands can be executed more efficiently. 

There are two device chain structures in the SCSI target simulator: the local 
disk chain and the remote disk chain. Figure 4 shows the local and remote disk 
chains. The structure of the local disk in the local disk chain contains a pointer 
which points to the structure of its mirrored disk. The relationship of mirrored 
disks can also be created between two local disks, just like the software RAID 
I. When a SCSI command arrives, the SCSI target simulator analyzes the SCSI 
command’s target disk and finds the remote mirrored disk through the pointer 
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Fig. 3. The I/O path of remote mirroring for the TH-MSNS 
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Fig. 4. The local and remote disk chains 



mentioned above. Then the SCSI target simulator fills the mirrored write request 
using the information of the mirrored disk such as the number of the SCSI host 
bus, target, or LUN and so on. Furthermore, the SCSI target simulator only 
maps the local disk to the host, so the remote disk is invisible to host. 



3.3 Disk Fail Over 

Although both the local and remote disks are also locally protected by RAID, 
many circumstances could cause a SCSI command’s failure. Some examples in- 
clude the power failure of the disk array, an unplugged SCSI cable, a break in 
the fiber connection or the failure of the RAID. Because of this, it is necessary to 
monitor all write commands locally and remotely. When a command is returned 
with an abnormal status, some actions must be taken immediately to ensure the 
services’ continuity. 

For the accidents mentioned above, SCSI commands will timeout. The SCSI 
driver layer will try to recover it or retry command. If these actions fail, the 
SCSI driver will return the SCSI command with the timeout status to the SCSI 
target simulator. The SCSI target simulator analyzes the status of the local and 
remote command, and adopts different measures to meet different instances. 
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• The remote write command fails, but the local command is ok. In this case, 
the remote disk is assumed to be defunct. The mirror relationship needs 
to be broken; the corresponding information should be recorded for later 
resynchronization. 

• The local write command fails, but the remote command is ok. In this case, 
the local disk is assumed to be defunct and the mirror relationship must be 
severed. The corresponding information must been recorded for later resyn- 
chronization. The most important action is to redirect the read and write 
commands on the local disk to the remote mirrored disk. 

• The local read command fails. In this case, the local disk is assumed to be de- 
funct. The mirror relationship is broken, and the corresponding information 
is recorded for later resynchronize. The most important action is to redirect 
the read and write commands on local disk to the remote mirrored disk. 

In order to perform the disk failover, the local disk’s structure contains a status 
identifier to record the status of the local disk. Table 1 shows the possible status 
of the local disks. In addition, some key remote mirroring implementation tech- 



Device’s state 


Description 


DEVICE.OK 


Disk is ok. 


DEVICEJDEFUNCT 


Disk is defunct. If disk has mirrored, it 
means that both local and remote disks are 
defunct. 


DEVICE_MIRRORED 


Disk has mirrored, both local and remote 
disks are ok. 


DEVICE_LOCAL_DEFUNCT 


Disk has mirrored. Remote disk is ok, but 
local disk is defunct. 


DEVICE_MIRROR_DEFUNCT 


Disk has mirrored. Local disk is ok, but 
remote disk is defunct. 


DEVICE_SYNCING 


Disk is synchronizing. 



Table 1. The status of the local disk 



niques, such as software LUN masking, online resynchronization and disaster 
tolerance, have been introduced in another paper [11]. 

4 Performance Evaluation 

The synchronous remote mirroring system was tested, and its performance was 
evaluated and analyzed. Because the read command is executed locally in the 
process of mirroring, our testing only used the write command. Table 2 shows 
the test configuration of the host and storage nodes. 
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Host A 



CPU 


Intel Xeon 700MHz x 4 


Memory 


IG 


OS 


Linux (kernel: 2.4.18) 


FC HBA 


Emulex LP9822Gb/s 



Host B 



CPU 


Intel Itanium 2 IGHz x 2 


Memory 


2G 


OS 


Linux (kernel:2.4.18-el2) 


FC HBA 


Emulex LP982(2Gb/s) 



Storage node and its storage subsystem 



CPU 


Intel Xeon 2.4GHz x 1 


Memory 


IG 


OS 


Linux (kernel: 2.4.18) 


FC HBA 


Emulex LP982 (Initiator mode2Gb/s) 
Qlogic ISP 2300 (Target mode2Gb/s) 


RAID Controller 


Adaptec Ultral60 RAID Controller 2110S 


SCSI Disks 


Seagate Cheetah (73GB lOKRPM) x 7 configured as JBOD 



Table 2. Test configuration of hosts and storage node 



4.1 Comparison of Average Command Response Times 

The average response time of the command is a very important factor to evaluate 
the performance and quality of services. In this test, a host issues commands with 
different data block sizes to its one ’network’ disk, which is provided by the local 
storage node. The goal is to compare the average response time of each command 
both with mirroring and without. The Iometer[12] benchmarking kit was used. 
The host issues sequential write commands with different block size ranging from 
64KB to 2048 KB. This test adopts a host configured as host A (given above) 
and two storage nodes: a local storage node and a remote storage node. Table 
3 shows the test results. The results show that synchronous mirroring has little 
impact on the average command response time for different block sizes. 



Block Size 


1 Command Average Response Time| 


No mirror (ms) 


Mirror(ms) 


64KB 


1.718 


1.795 


128KB 


3.482 


3.573 


192KB 


5.335 


5.357 


256KB 


7.083 


7.202 


512KB 


13.573 


14.469 


768KB 


21.214 


21.561 


1024KB 


24.797 


28.610 


1536KB 


42.245 


44.048 


2048KB 


58.141 


58.323 



Table 3. Comparison of average command response time on different block sizes 
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4.2 The Total Execution Time for Replication in a Large Amount 
of Data 

In this test, we used command dd on the host to replicate a large amount of data 
(10GB-50GB) to the network disk. By comparing the total execution time with 
and without mirroring, we evaluated the performance of synchronous mirroring. 
Gommand dd is able to directly generate read/write block-level requests. This 
test adopts a host configured as host A and two storage nodes. Figure 5 shows 
the test results. The results show that synchronous mirroring has little impact 
on the total execution time for replicating a large amount of data. 




Fig. 5. The total execution time for replication of a large amount of data 



4.3 The Storage Node Bandwidth with Heavy Loads 

In this test, we adopted seven hosts, local and remote storage nodes which both 
have 7 SGSI disks and are mapped to the hosts. Each host accesses its one 
network disk respectively. We ran lOzone [13]benchmarking kit on each host to 
offer the local storage node a heavy load. The test used sequential 100% write 
requests with different record sizes ranging from 4 KB to 4096 KB. The hosts file 
system were ext 2, and the test files size was 15 GB. We compared the bandwidth 
of local storage node with mirroring and without mirroring. The seven hosts were 
configured as host B. Figure 6 shows the results. The results show the bandwidth 
of storage node with different record sizes. With the host cache, the results 
appear higher than actual bandwidth of the storage node. In the figure, the 
black bars represent the results without mirroring, with an average bandwidth 
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Fig. 6. A comparison of storage node bandwidth with heavy loads 



of 144.314MB/s, while the white bars represent the results with mirroring, with 
an average bandwidth of 142.399MB/s. The bandwidth of storage node with 
mirroring is 98.67% of that without mirroring. The performance loss is very 
little. 

In addition, the SCSI HBA we adopted in the test was an Ultra 160 card, so 
the max throughput in theory was only 160Mbyte/s. Furthermore, the storage 
node is a server which is only responsible for I/O operations without caring other 
services, so its CPU has not heavily load. Actually, its CPU utilization is less 
than 20% at most time. 

5 Conclusion 

In this study, we designed and implemented a storage-based synchronous remote 
mirroring system for the TH-MSNS. Based on the high bandwidth and long- 
distance linking ability of dedicated fiber connections, this approach provides a 
consistent and up-to-date data copy in a remote location to meet the demands 
of disaster recovery. This system is independent of the actual storage unit, and 
the process of mirroring is transparent to the hosts. In addition, we present a 
disk failover solution. In the performance evaluation, we compared the average 
command response time with different block sizes, the total execute time in 
replicating a large amount of data, and the bandwidth of storage node under 
heavy loads, both with and without synchronous mirroring. The performance 
results indicate that the bandwidth of storage node with mirroring under heavy 
loads is 98.67% of the bandwidth of storage node without performing mirroring, 
a result showing only slight performance loss. This means that our synchronous 
remote mirroring has little impact on the host’s average response time and the 
actual bandwidth of the storage node. 
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Abstract. IP storage becomes more commonplace with the prevalence 
of the iSCSI (Internet SCSI) protocol that enables the SCSI protocol to 
run over the existing IP network. Meanwhile, storage QoS that assures a 
required storage service for each storage client has gained in importance 
with increased opportunities for multiple storage clients to share the 
same IP storage. Considering the existence of other competing network 
traffic in IP network, we have to provide storage I/O traffic with guar- 
anteed network bandwidth. Most importantly, we need to calculate the 
required network bandwidth to assure a given storage QoS requirement 
between a storage client and IP storage. This paper proposes a network 
bandwidth computation technique that not only accounts for the over- 
head caused by the underlying network protocols, but also guarantees the 
minimum data transfer delay over the IP network. Performance evalua- 
tions with various I/O workload patterns on our IP storage testbed verify 
the correctness of the proposed technique; that is, allocating a part (0.6- 
20%) of the entire network bandwidth can assure the given storage QoS 
requirements. 



1 Introduction 

Storage Area Networks (SAN), such as Fiber Channel and Gigabit Ethernet, 
have enabled a plethora of storage systems to be maintained as a storage pool, 
resulting in reduced total cost of ownership, effective storage resource man- 
agement, etc. Such SAN-based storage systems are advantageous in terms of 
scalability and configurability, compared with SCSI bus-based storage systems. 
Accordingly, a few data transmission protocols have newly emerged to support 
the SAN environment. Fiber Channel protocol (FCP) is developed for FC-based 
SAN, and iSCSI recently ratified by Internet Engineering Task Force is made for 
IP SAN. A main advantage of iSCSI is that the iSCSI can operate on standard 
network components, such as Ethernet [I]; that is, it exploits existing features 
and tools that have been developed for the IP network. Thus, this paper focuses 
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on the storage environment using IP-based SAN (IP storage), where storage 
devices are attached to IP networks, and storage clients communicate with the 
storage devices via the iSCSI protocol [1]. An initiator-mode iSCSI protocol runs 
on the storage client, whereas a target-mode iSCSI protocol operates on the IP 
storage. Note that the traditional SCSI protocol operates on top of the iSCSI 
protocol layer that transmits a given SCSI command to its associated IP storage 
over the IP network. 

With the advance in storage technologies in terms of storage space and I/O 
performance, the chances increase that multiple storage clients share the same 
storage. A different storage client may require a different storage service, called 
storage Quality of Service (QoS); that is, each storage client requires receiving 
a guaranteed storage service, independently of the status of the I/O services 
in other storage clients. Unfortunately, the storage itself does not contain any 
feature of assuring storage QoS. As a result, recent research efforts [2,3] try to 
add the QoS feature to various types of storage systems. However, notice that 
the previous research emphasizes the QoS issue only within the storage system, 
whereas it is assumed that the SAN itself has no QoS issues. Note that, FC- 
based SAN is used only for storage I/O traffic. IP storage, however, transmits 
its data over the IP network, where storage I/O traffic is likely to coexist with 
the other network traffic. Considering this situation leads us to preserve an 
amount of network bandwidth for the storage I/O traffic between a storage 
client and its associated IP storage to avoid any probable interference with the 
other network traffic. A naive approach is to allocate the full network bandwidth 
(or separate dedicated IP network) that is large enough to serve the storage I/O 
traffic with QoS guarantee for a pair of a storage client and its associated IP 
storage. However, it can be easily inferred that this approach ends up being 
with under-utilization of IP network resources, even though it can certainly 
guarantee a given storage QoS requirement. By contrast, unless enough network 
bandwidth resides between the storage client and the IP storage, the storage QoS 
requirement is no longer guaranteed with lower I/O throughput (I/O requests 
per seconds) and higher response time due to increased data transfer delays. 

This paper emphasizes the problem of computing the required network band- 
width to meet a given storage QoS requirement. It proposes a network bandwidth 
computation technique that not only accounts for overhead caused by the under- 
lying network protocols, but also guarantees the minimum data transfer delay 
over the IP network. In the case of FC-based SAN environments. Ward et. al. 
in [4] proposed a scheme to automatically design an FC-based SAN that not 
only serves a given set of storage I/O traffic between storage clients and storage 
systems, but also minimizes the system cost. 



2 Problem Description 

We begin by defining a few notations to be used throughout the paper. Ci and 
Qi represent the storage client i and its storage QoS requirement, respectively. 
Generally, the storage QoS requirement is defined as Qi = {fi, iopsi, szi, Si, rti}. 
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The notation of fi represents the ratio of read I/O requests. In addition, iopsi, 
szi, and rti represent the number of I/O requests per second (briefly lOPS), 
an average I/O request size, and an average response time requested from Ci, 
respectively. I/O access pattern is random with Sj = 0 and purely sequential with 
Sj = 1. Note that our network bandwidth computation does not depend on the 
value of Si- We denote with 6/^® and the network bandwidth allocated for 
the direction from Ci and its associated IP storage and for its opposite direction, 
respectively. Then, the problem that this paper will solve can be described as 
follows: Compute 6/^® and 6®^'^ that satisfy the given Qi for the storage client 
Ci and its associated IP storage. Note that we assume that the storage resources 
except for network resources (bandwidth) have been appropriately reserved to 
satisfy the storage QoS requirement Qi. 



3 The Proposed Technique 

To begin, we will explain the protocol layering for iSCSI protocol and the spe- 
dflc behavior of the iSCSI protocol for read and write I/O requests. The iSCSI 
protocol data unit (PDU) consists of a 48-byte iSCSI header and iSCSI data 
of a variable length. The maximum iSCSI data length relies on the types of 
underlying Ethernet cards. Typically, it ranges from 1,394 through 8,894 bytes. 
The TCP/IP headers and Ethernet header respectively occupy 40 bytes and 18 
bytes. 

Figure 1 presents the protocol behaviors for read and write I/O requests. 
In the case of the read I/O request, as shown in Figure 1(a), the storage client 
sends the READ SCSI command to the IP storage. Next, after reading the re- 
quested data from its internal disk drives, the IP storage transmits the data 
to the storage client in the DATA_IN phase. Note that the data are to be frag- 
mented into smaller pieces according to the maximum iSCSI data length. Fi- 
nally, the IP storage sends the response message to the storage client. The iSCSI 





(a) read 



(b) write 



Fig. 1. iSCSI protocol behaviors for read and write I/O requests: (a) read and 
(b) write 
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protocol behavior for the write I/O request is more complicated than that of 
the read I/O request because of free buffer management and performance opti- 
mization techniques like immediate/unsolicited data transmission. The storage 
client sends the WRITE SCSI command to the IP storage. Unless the data size 
is greater than the maximum iSCSI data length, it is transmitted along with 
the write SCSI command. This results in collapsing the COMMAND phase and the 
DATA_0UT phase [1]. It is called immediate data transmission [1]. If the data 
size is greater than FirstBurstLength, the storage client transfers the data 
of the first FirstBurstLength bytes to the IP storage without receiving the 
Ready to Transfer (R2T) message from the IP storage that is used to secure 
free buffer space to store the write data. This process is called unsolicited data 
transmission [1]. Note that the immediate data transmission is combined with 
the unsolicited data transmission. Afterwards, the storage client transfers data 
only when it receives the R2T message from the IP storage. It is called solicit 
data transmission [1] . 

In what follows, we derive a set of equations to compute the required network 
bandwidth to meet a given storage QoS requirement. We start by computing the 
amount of data transfer, including the underlying protocol overhead for given 
read and write I/O requests of size Si. We denote with D^~^^{szi) the amount of 
data transfer from the storage client to the IP storage for the read I/O request 
of size szi, and D^~^‘^(szi) for the opposite direction. From the protocol behavior 
for the read I/O request, as shown in Figure 1(a), we can easily obtain D'^~^^{szi) 
and D^~^‘^{szi) as follows: 

— OVprot: (1) 

D^~^'^{sZi) = 2oVprot + L^J'5'/ + sZi mod Sf, (2) 

Sf 

where Sf = Sf — ovprot- The notation of ovprot represents the 106-byte protocol 
overhead caused by the Ethernet header, the TCP/IP header, and the iSCSI 
header. The notation oi Sf represents the underlying Ethernet frame size. 

Next, we calculate the amount of data transfer for the write I/O request. To 
begin, the storage client sends the agreed-upon amount of data (unsolicited data 
transmission) to IP storage without having an R2T message. We assume that 
the behavior of each data transfer follows that of the solicit data transmission 
because the network traffic under our consideration is heavy enough. However, 
if the I/O request size is not greater than the maximum iSCSI data length, the 
associated data is delivered to the IP storage by using the immediate data trans- 
mission. As with the read I/O request, we denote with D‘^“{szi) the amount 
of data from the storage client to the IP storage for the write I/O request, and 
D^‘^{szi) for the opposite direction. Based on the iSCSI protocol behavior for 
the write I/O request, as shown in Figure 1(b), D‘^^{szi) and D^‘^{szi) can 
be obtained as follows: 






OVprot + SZt if SZi< Sf 

oVprot + + sZi mod Sf otherwise ’ 



(3) 
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/ oVprot if szi < FirstBurstLength 
^ 2oVprot otherwise 



If the data size is not greater than the maximum iSCSI data length of Sf, 
D'^^{szi) is OVprot + sZi- Otherwise, it includes times of the maximum 

Ethernet frame size of S'/ and the remaining data with the protocol overhead of 
OVprot- As for D^‘^{sZi), if the data size is not greater than FirstBurstLength, 
D^‘^{szi) is equal to the size of the iSCSI response message. Otherwise, it 
becomes two times of oVprot, because the R2T message is also delivered to the 
storage client as well as the iSCSI response message. 

Next, we define with the average amount of data transfer including the 
underlying protocol overhead from the storage client to the IP storage for the 
sZi sized I/O request. In addition, we define with 6®^'^ for the opposite direction. 
From Equation (l)-(4), we can obtain 6/^® and as follows: 

= {fiDr~"''{sZi) + (1 - f^)D^'^{sz,)}iops,, (5) 

= {fiDr~""{sZi) + (1 - f^)D^‘'{sz^)}iops^. ( 6 ) 



The network bandwidth allocation with 6/^® and is expected to assure the 
requested maximum storage bandwidth derived by multiplying sZi and iopsi. 
However, they may not guarantee the demanded response time of rti. For exam- 
ple, notice that Equation (5) and (6) result in a lower network bandwidth with 
a smaller szi-iopsi. This implies that the chances increase that each I/O request 
of size sZi experiences a longer transmission delay on IP network that is most 
likely to entail a violation of the demanded response time. 

Thus, we introduce the minimum network bandwidth to assure the demanded 
response time of rti. We denote with rrii^^ and the minimum network 

bandwidth for each direction. The values of m^^® and m®^° are determined 
such that the transmission delay of each I/O request are not greater than ai-rti, 
where 0 < < 1. Usually, ai is determined according to the marginal response 

time to meet rti in the phase of designing the associated IP storage [2,5,3]. For 
example, if the IP storage is designed to assure 15msec for given rti = 20msec, the 
values of ai can range from 0 through 0.25. We compute m^^® and m®^'^ from a 
simple relationship that the expected transmission delay is inversely proportional 
to the allocated network bandwidth without accounting for the effects of traffic 
congestion control, IP routing, the transmission buffer size, TCP retransmission, 
etc. The m/^® and are written as follows: 



_ max{[f,\D^-^^{sZi), [1 - f,\D^^{sZj)} 
a^rti 

_ TOaa:{[/,jTi®^°(sZi), [1 - h\D^^‘^{sZj)} 
* o^^rti 



( 7 ) 

(8) 



By denoting with 6/^® and 6®^"^ the network bandwidth for each direction re- 
quired to guarantee a given Qi, we finally have 6/^® and 6®^''^ as follows: 



bV" = max{k~"" 



(9) 

(10) 
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4 Performance Evaluations 

We set up an experimental testbed for IP storage to evaluate the performance 
of the proposed technique. We use two Intel Pentium III based desktops for the 
storage client and the IP storage. Both systems are attached to a Gigabit IP 
network via Gigabit Ethernet cards and a switch. Assume that no other traffic 
exists between the two systems. The maximum size of the Ethernet frame is 
1500 bytes. The Linux kernel 2.4.18 works on top of the storage client and IP 
storage. The storage client includes the initiator-mode iSGSI driver developed 
by the University of New Hampshire [6] , and the IP storage contains the target- 
mode iSGSI driver for operating the iSGSI protocol. The network bandwidth 
is controlled by Token Bucket Filtering (TBF) [7]. We believe that this type 
of end-to-end traffic control works because no other traffic exists between the 
storage client and the IP storage. 

Table 1 shows the measured iopsi and rti for various QoS requirements of 
Q 1 -Q 4 when the amount of network bandwidth is computed by the proposed 
technique. In addition, they are compared with the results with the full network 
bandwidth for the same QoS requirements. We denote with prot-rt and full 
for the proposed technique and the full network bandwidth, respectively. Each 



Table 1. Results of iopSi and rti for the various QoS requirements of Qi~Qs 
by the proposed technique (prot-rt) and the full network bandwidth allocation 
(full) 



QoS 


Technique 


Toward IP storage 


Toward client 


iopsi 


rti 


Qi 


prot-rt 


0.05MB/S 


0.60MB/S 


84(100%) 


9.74ms 


full 


lOO.OOMB/s 


lOOMB/s 


82(100%) 


9.70ms 


O 2 


prot-rt 


1.84MB/S 


0.17MB/S 


265(99%) 


2.59ms 


full 


lOO.OOMB/s 


lOOMB/s 


266(100%) 


2.58ms 


Qs 


prot-rt 


0.03MB/S 


19.19MB/S 


77(100%) 


15.66ms 


full 


lOO.OOMB/s 


lOOMB/s 


76(100%) 


15.34ms 


Qi 


prot-rt 


15.67MB/S 


0.02MB/S 


211(100%) 


20.28ms 


full 


lOO.OOMB/s 


lOOMB/s 


208(100%) 


21 .,54ms 



of the storage QoS requirements is represented as follows: Qi = {/i = 1, iopsi = 
82,szi = lKB,si = 0,rti = 10ms}, Q 2 = {/2 = 0 ,iops 2 = 266, SZ2 = 
IKB,S 2 = 0 ,rt 2 = 3ms}, Q 3 = {/a = l,iops 3 = 76,szs = 64KB, S 3 = 0 ,rt 3 = 
18ms}, and Q 4 = {/a = 0 ,iops 4 = 208, SZ4 = 64KB, S 4 = 0 ,rt 4 = 22ms}. The 
first two requirements are for the small read and write I/O requests as in OLTP 
applications, and the others are for the large read and write I/O requests in sci- 
entific applications [5] . Since the network bandwidth allocation is independent of 
storage access patterns, we assume that all the storage access patterns are purely 
random. In addition, the demanded lOPS of iopSi and its associated response 
time of rti are configured by injecting a set of I/O workload patterns and mea- 
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suring each performance. Note that the iopsi and rti of the QoS requirements 
fall into neither too light traffic that makes the IP storage mostly idle nor too 
heavy traffic that overloads the system. The results reveal that the allocation 
of only 0.6-20% of the full network bandwidth computed by the proposed tech- 
nique can meet the given storage QoS requirements. Observe that the proposed 
technique can compute an appropriate amount of network bandwidth to provide 
the same quality of storage service as the case when the full network bandwidth 
is allocated. The percentage values in the iopSi column represent the percentage 
of the measured lOPS with respect to the demanded lOPS. 

Table 2 shows the measured iopst and rU for the QoS requirements of Q 1 -Q 4 
for the prot and naive techniques. The prot technique computes the required 
network bandwidth only by considering the underlying protocol overhead, as 
shown in Equation (5)-(6); that is, 6°^^ and The naive technique simply 

calculates the network bandwidth by multiplying iopSi and szi. It assigns the 
same bandwidth for each direction. As expected, the naive technique cannot 
guarantee even demanded iopSi, because it does not account for the underlying 
protocol overhead in Ethernet, TCP/IP, and iSCSI layers at all. Notice that 



Table 2. Results of iopSi and rti for the various QoS requirements of Qi~Qs by 
the technique using Equation (5)-(6) (prot) and the technique using iopsi-szt 
(naive) 



QoS 


Technique 


Toward IP storage 


Toward client 


iopsi 


rti 


Qi 


naive 


0.06MB/S 


0.06MB/S 


64(77%) 


28.82ms 


prot 


O.OlMB/s 


0.12MB/S 


80(96%) 


12.99ms 


Q 2 


naive 


0.26MB/S 


0.26MB/S 


238(89%) 


9.25ms 


prot 


0.29MB/S 


0.03MB/S 


262(98%) 


3.54ms 


Q 3 


naive 


4.78MB/S 


4.78MB/S 


69(91%) 


22.31ms 


prot 


O.OlMB/s 


5.19MB/S 


74(96%) 


19.33ms 


Qa 


naive 


13.06MB/S 


13.06MB/S 


194(93%) 


25.13ms 


prot 


14.08MB/S 


0.02MB/S 


208(99%) 


22.05ms 



the smaller sized I/O request causes higher protocol overhead, as expected from 
Equation (l)-(4). In addition, the results show that read I/O requests creates 
more protocol overhead than write I/O requests. It can be expected mainly from 
Equation (2)-(3). By contrast, the prot technique guarantees more than 96% 
of the required iopSi. However, it does not satisfy the demanded response time 
of rti because of the transmission delay on IP network. Recall that, as shown in 
Table 1, the proposed prot-rt technique can meet both the iopSi and the rti 
by effectively allotting the network bandwidth for each direction between the 
storage client and the IP storage. 




480 



Y.J. Nam et al. 



5 Conclusion and Future Work 

This paper addressed the problem of effectively allocating network bandwidth 
to assure a given QoS requirement for IP storage. It defined a specification of 
the storage QoS requirement, and it proposed a technique to compute the de- 
manded network bandwidth to meet the storage QoS requirement that not only 
accounts for the overhead caused by the underlying network protocols, but also 
guarantees the minimum data transfer delay over the IP network. Performance 
evaluations with various I/O workload patterns on our IP storage testbed verified 
the correctness of the proposed technique; that is, allocating a part (0.6-20%) 
of the entire network bandwidth can assure the given storage QoS requirements. 
Currently, we have been revising Equation (9)-(10) to additionally account for 
real-world I/O workload patterns featured by self-similarity and the condition 
of traffic congestion. 
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Abstract. In conventional DHTs, each node is assigned an exclusive 
slice of identifier space. Simple it is, such arrangement may be rough. In 
this paper we propose a generic component structure: several indepen- 
dent nodes constitute a cell; a slice of identifier space is under nodes’ 
condominium; part of nodes in the same cell cooperatively and trans- 
parently shield the internal dynamism and structure of the cell from 
outsiders; this type of structure can be recursively repeated. Cells act 
like raw nodes in conventional DHTs and cell components can be used 
as bricks to construct any DHT-like systems. This approach provides en- 
capsulation, scalable hierarchy, and enhanced security with bare incurred 
complexity. 



1 Introduction 

Many Distributed Hash Tables (DHTs) ([1], [2], [3], [4], [5], [6]) have been pro- 
posed in recent years. A distinguishing feature provided by these systems is 
scalable routing performance while keeping a scalable routing table size in each 
node. Of those algorithms, every node has a unique numerical identifier and 
the identifier space is allocated among participant living nodes, which is solely 
responsible for its assigned exclusive slice, or zone, of identifier space. There is 
no central node and every node is identical and visible to all other nodes. The 
methodology discarding role difference is simple but may be rigid in practice. 

In previous decade, success in object-oriented programming revealed the 
power of encapsulation: objects encapsulate internal implementations and states; 
they supersede discrete functions and variables, and communicate through ex- 
ported public properties and methods. We are enlightened to present Parame- 
cium architecture: raw nodes aggregate into composite cells, which communicate 
with each other through exported constituent nodes. The extension is simple but 
promising in two traits: encapsulation and shared zone. 
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2 Design of Paramecium 

Paramecium’s goal is to shield highly dynamic behaviors of unstable nodes in 
system and improve security in P2P’s open environment. To achieve this goal, 
Paramecium brings in encapsulation and condominium through cell structure. 
In this section, we describe Paramecium’s cell structure, necessary exported 
properties and functions. To be generic, we only depict abstract implementation 
of Paramecium, and leave specific-related issues to concrete implementation. 
For conciseness, the difference and comparison between Paramecium and other 
conventional DHTs are here emphasis. 



2.1 Cell Structure 

Atom Cell Adjacent independent nodes in identifier space constitute an atom 
cell, which could be identified by an exclusive slice, named cell zone, of identifier 
space. The set of all cell zones covers the whole identifier space. A node resides 
and only resides in one atom cell. And a cell can be made up of only one node. 
There is no existent dissociative node. Nodes in the same atom cell are called 
sibling nodes. 

Sibling nodes can be organized into flat structure (eg: full connection or 
DHTs), or hierarchic structure (eg: spanning tree). Paramecium does not spec- 
ify any material internal structure, including cell zone’s division among sibling 
nodes, and maintenance mechanism in a cell. 

Nodes in a cell are categorized into two role types: boundary nodes and hidden 
nodes. As representatives of their resident cells, boundary nodes are responsible 
for requests from nodes in different cells. Hidden nodes facilitate sibling bound- 
ary nodes to perform exported functions of resident cell, but they don’t serve as 
representatives. Besides a certain type of request, hidden nodes will reject any 
request from outsiders. Hidden nodes can directly request non-sibling boundary 
nodes for services, or sibling boundary nodes for relaying requests. Every node, 
whether boundary node or hidden node, provides a type of service called give- 
MeRepresentatives. The semantic of giveMeRepresentatives is self-explaining and 
straightforward: when a node X receives a request of this type, X corresponds 
with the representatives of its resident cell. Role’s selection depends on node’s 
discretion and giveMeRepresentatives' s implementation. The latter is specific- 
related and out of the concern of Paramecium. Drawing an analogy between 
Paramecium and class, we can image that similar to a class, boundary nodes are 
cell’s public methods while hidden nodes are its protected or private methods. 
Boundary nodes encapsulate implementation of resident cells and export cor- 
responding properties and functions. Hidden nodes’ rejection to outer requests 
enforces the rule of encapsulation. We suggest that representatives free hidden 
nodes from inter-cell level business, allowing them to concentrate on internal 
affair. To be concise, we use cell and cell’s boundary nodes exchangeably in the 
rest of this paper if no confusion exists. 
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Evolve to Organism As a natural extension, the cell structure can be recur- 
sively repeated: cells conglomerate into a larger and higher level cell (organism). 
The grammar expressed in BNF (Backus-Naur Form) is: cell ::= cell\{cell} . All 
cells immediately forming a new cell X are called X's child cell. This is a hier- 
archic architecture abiding with the principle of encapsulation. Child cells can 
serve in two ways: as boundary cells or as hidden cells. Only boundary cells 
are exported to the outside world of their resident cell. The implementation, 
maintenance, and internal dynamics of a cell are shielded by its boundary cells, 
similar to an atom cell. A higher level cell has no knowledge and interest of low 
level businesses. Considering the consequential benefit, the incurred complexity 
should be justifiable. 

2.2 Modification to Conventional DHTs 

Routing Table In addition to its own zone and traditional routing table called 
inter-cell routing table here in Paramecium, a node must maintain the state 
composed of its resident cell zone and its intra-cell routing table. Each entry in 
the intra-cell routing table contains NodelD, zone, and other implementation- 
related information of a sibling, node or subcell. A node’s intra-cell routing table 
can include partial or entire siblings. The connection topology in a cell and the 
maintenance of intra-cell routing table depend on concrete implementation. 

The inter-cell table can be constructed and maintained by Chord [2] , Pastry [3] , 
or other DHT protocols. Although an entry in an inter-cell routing table points 
to an appropriate top-layer cell, the content is cell’s boundary node, as well as 
cell’s zone. The selection of boundary nodes through service giveMeRepresen- 
tatives is also implementation-related. But we must remember that all sibling 
boundary nodes are eligible candidates. Sibling nodes/subcells can share a com- 
pletely same inter-routing table. Thus, only part of nodes have to actively main- 
tain inter-routing tables and disseminate results to their siblings. This difference 
tells Paramecium from other DHTs. 

Routing The amendment to conventional routing schemas is trivial. A node 
first checks whether the routing target falls into its resident cell’s zone. If yes, 
the node employs implementation-related approach with the help of intra-cell 
routing table to route the request (the simplest scenario is that when sibling 
nodes are full-connected, the final target can be reached in one-hop by local 
lookup). Otherwise, the request is routed by inter-cell routing table and specific 
inter-cell routing algorithm. 

Node join The join operation is intuitive. Assumed a node X wants to join an 
existing Paramecium system, X first finds an atom cell C whose zone covers X 
through routing algorithm described in above paragraph. X then informs other 
sibling nodes residing in C of joining message. Meanwhile, X learns intra-cell 
and inter-cell routing tables from them. If X only acts as hidden node, there 
is no perceivable changes to other cells. Otherwise, other cells will eventually 
detect the X's arrival in process of periodical update of inter-cell routing table 
by the service giveMeRepresentatives. 
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Node departure Ordinarily, a node can crash or leave system unpredictably to 
relevant nodes. Similar to node’s join, the departure of a hidden node does not 
affect other cells’ routing table except the node’s sibling nodes’ intra-cell routing 
table. The effectiveness of encapsulation apply here again. 

3 Related Work 

There are many existing or under development DHTs. To the best as we know, 
Paramecium is the first general architecture that introduces the concepts of 
encapsulation and jointly governed zone by a group of nodes. 

There are some similarities between Paramecium and CAN[1], Chord [2], 
Pastry[3], Tapestry[4], SkipNet[5], and Koorde[6]. In these conventional DHTs, 
however, each node exclusively takes portion of identifier space and exposes itself 
to others in a system level. Nodes can share partial routing table, but they don’t 
support encapsulation, too. 

4 Conclusion 

Recognizing that raw nodes acting as bricks to build DHTs may not be flexible, 
Paramecium extends the construction unit from primitive node to composite cell 
which is made of nodes/subcells in a intuitive and efficient way. The cell structure 
shields its internal dynamism and structure through the distinguished charac- 
teristics of encapsulation: cells interact with each other only through boundary 
nodes/subcells. Recursive cell composition provides analogy to hierarchal struc- 
ture in a scalable way. With the help of the trait of jointly shared zone among 
sibling nodes/subcells, Paramecium can also enhance security to some degree 
through Practical Byzantine-like algorithms. 
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Abstract. This paper introduces a replication method for object-oriented data 
storage that is highly flexible to fit different applications to improve availabil- 
ity. In view of semantics of different applications, this method defines three 
data-consistency criteria and then developers are able to select the most appro- 
priate criteria for their programs through storage APIs. One criterion realizes a 
quasi-linearizability consistency, which will cause non-linearizability in a low 
probability but may not impair the semantic of applications. Another is a 
weaker one that can be used by many Internet services to provide more read 
throughput, and the third implements a stronger consistency to fulfill strict lin- 
earizability. In addition, they all accord with one single algorithm frame and 
are different from each other in a few details. Compared with conventional ap- 
plication-specific replication methods, this method has higher flexibility. 



1 Introduction 

We implement an object-oriented data management layer as a cluster infrastructure 
software, specifically for the construction of Internet services. The impedance mis- 
match problem [1] is avoided because its interface is compatible with Java Data Ob- 
jects API [2], an emerging standard for transparent data access in Java. 

However, existing replication methods [3][4][5] cannot be adopted by OStorage. 
Therefore, we design one replication method that can be embedded in storage APIs. 
Users can employ it to set parameters on replication consistency and ease the devel- 
opment. 

In the replication method, the principles, including high flexibility, low latency and 
adjustability are emphasized. The flexible replication method defines and implements 
three consistency criteria, soft consistency, general consistency and rough consis- 
tency. General consistency realizes a quasi-linearizability consistency, which will 
cause non-linearizability in a low possibility but do not impair the semantic of appli- 
cations. Moreover, it does not introduce any total order multicast to solve consensus 
problem so that the latency is very low. Soft consistency is a weaker one that does not 
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support linearizability at all to support more read throughput. And the third imple- 
ments a stronger eonsisteney to fulfdl striet linearizability. They aeeord with one sin- 
gle algorithm frame and are different from eaeh other in some details. 

The rest of this paper is organized as follows. Seetion 2 presents three eonsisteney 
eriteria and emphasizes solutions of key issues. Seetion 3 summarizes this paper. 



2 Replication Algorithms 

To simplify the presentation of the method, the system model is deseribed absfraetly 
as follows. 

- Reliable point-to-point eommunieation is supported by the low-level network pro- 
toeol with FIFO property. Reliability means the network layer guarantees a re- 
eeiver will get the message in lateney p, after sending, or the reeeiver ean be 
assumed wrong. 

- A system-global loosely synehronized eloek is implemented using Network Time 
Protoeol, whose max time error, x , is less than 1ms. 

- Nodes and network links may erash, but we assume Byzantine failures and net- 
work partition will not oeeur. 

- The atomie data eell is ealled objeet, whieh is a data bloek with variable size. 

- Three kinds of permitted eommands upon an objeet are read, overwrite and re- 
eover-read, while ereating and deleting are two forms of overwrite operation. 
When a erashed storage node, named Briek, reeovers, the objeets it holds is out- 
dated and it will send a reeover-read eommand to other Brieks to update data. 

In addition, some terminologies are introdueed. 

- A eommand C is deseribed as <type, O, D, timestamp> where type is one of over- 
write, read and reeover-read, O is the objeet to operate, D is null if type is not 
overwrite and timestamp denotes the time when this eommand is invoked. 

- An objeet O is deseribed as <OID, D, etype, etimestamp> where OID is its identi- 
fieation, etype is the type of the latest eommand on it and etimestamp is timestamp 
of that eommand. 

Several eopies of an objeet O are distributed to a set of Brieks that is noted as 
view(O). Eaeh eopy is ealled a repliea. Before an objeeted O is ereated, some Brieks 
should be seleeted to form view(O). 



2.1 General Consistency Algorithm 

General eonsisteney is almost equal to linearizability and ean be used in most appliea- 
tions. A global loosely synehronized eloek algorithm is applied to keep the order of 
eommands in general eonsisteney eriteria. And there may be a tiny eloek error be- 
tween every two nodes, whieh makes eausality unsatisfied in a low probability. Its 
prineiples of different operations are deseribed as follows. 

Read: One read eommand is sent to any Briek in view(O). While no sueeessful re- 
sponse returns, it will be redireeted to another Briek until the data is obtained. 
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Write: A overwrite command is sent to every Brick in view(O) and waits for a re- 
sponse. Once the first successful response returns, the overwrite command is assumed 
accomplished. In addition, Thomas write rule [6] is employed here. That is, it attaches 
timestamps to commands and objects, and only the latest update is accepted. Then, 
the timestamp of an object is the one from its latest command. 



2.2 Soft Consistency Algorithm 

Linearizability is too strict for some applications that need faster response (especially 
for read). Soft consistency algorithm is therefore designed. Compared with lineariza- 
bility, soft consistency is defined as follows. 

CS is the set of all actual commands. The execution is said to be soft consistent if 
there exists a sequence S on CS satisfies these two conditions. 

- Ca is before Cb in S if Ca«Cb; But if Cb is a read command, it is not necessary to 
satisfy this condition. 

- This condition is as same as that of linearizability. 

It allows read commands to obtain stale data in fact. Compared with general con- 
sistency, it is only different in the implementation of read. That is, read command is 
executed as soon as received without concerning any other conditions. For many 
Internet services it is valuable. For example, a cluster-based email server maintains 
two mailboxes for every user, and it will save any email to both boxes. On reception 
of an email for user U, the server updates box ul firstly. At that time, the user visits 
his/her box u2 that has not been refreshed, so he/she will get the stale data. But it does 
not impair the usage of mailboxes because users can imagine the email is still in 
transmission, which still observes email protocols. 



2.3 Rough Consistency Algorithm 

Rough consistency offers strict linearizability, but it introduces more latency. This 
algorithm employs a global token system [7] to replace the loosely synchro- 
nized clock. 

Before a command is sent, a global token is applied to attach it whose number is 
increased seriatim. Bricks operate commands in accordance with the sequence of to- 
kens, and it is prohibitive to operate a command which token number is not larger one 
than the prior. In this case, the Brick has to wait for commands between the twos. 

In addition, only if all successful responses of every Brick in view(O) have re- 
turned, the overwrite command is assumed accomplished. So total order is maintained 
strictly and the adverse circumstance in general consistency will never happen. 



2.4 How to Recover 



When a crashed Brick recovers, the data it holds is outdated and should be refreshed. 
So it sends a recover-read command to other Bricks and buffers commands received 
until all data are updated. Because general consistency algorithm does not solve 
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consensus problem, after carrying out a recover-read command on object o, one Brick 
may receive a overwrite command on O accomplished in another Brick before the re- 
cover-read command. Therefore linearizability cannot be maintained. The solution is 
a the command will be put off for a period of time, T^j^j (equal to T) to execute. 



3 Conclusion 

General consistency realizes a quasi-linearizability consistency, which will cause 
non- linearizability in a low possibility but developers can adopt rough consistency or 
improve the applications to avoid this case. Soft consistency is a weaker one that 
permits applications to obtain stale data and is fit for some Internet services. More- 
over, rough consistency realizes strict linearizability but introduces longer la- 
tency. 
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Abstract. Network attached storage system is proposed to solve the bottleneck 
problem of the multimedia server. It adds a network channel to the RAID and 
data can be transferred between the Net-RAID and clients directly. The archi- 
tecture avoids expensive store-and- forward data copying between the multime- 
dia server and storage devices when clients download/upload data from/to the 
server. The system performance of the proposed architecture is evaluated 
through a prototype implementation with multiple network disk arrays. In 
multi-user environment, data transfer rate is measured 2~3 times higher than 
that with a traditional disk array, and service time is about 3 times shorter. Ex- 
perimental results show that the architecture removes the server bottleneck and 
dynamically increases system bandwidth with the expansion of storage system 
capacity. 



1 Introduction 

Multimedia service is pervasive on the Internet now and continues to grow rapidly. 
Most multimedia service provider systems have adopted a typical system architecture 
in which the storage devices are attached privately to the server. When a client 
browses some multimedia data from the server, data should be fetched from the stor- 
age devices and then forwarded to the client by the server. Unfortunately, with the 
steady growth of Internet subscribers , the multimedia server quickly becomes a 
system bottleneck. 

More recently, there have been some research efforts invested in solving the bot- 
tleneck problem of the multimedia servers. A distributed server architecture that 
places the streaming servers close to the user clusters has been proposed, where the 
system is able to achieve scalable storage and streaming capacities by introducing 
more repository servers and local servers as the traffic increases. A scalable multime- 
dia server based on a clustered architecture is discussed, where a group of nodes 
are connected by a switch (interconnection network). These related works have made 
it progressive to enhance the aggregate bandwidth of the multimedia system. 

In this paper. Network attached Redundant Arrays of Independent Disks(Net- 
RAID) is proposed to solve the server bottleneck. There are two different channels in 
the disk array. One is SCSI (Small Computer System Interface) bus to make the disk 
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array work as a normal storage system. And the other is network interfaee to transfer 
data between elients and the disk array direetly. A multimedia server with Net-RAIDs 
is implemented and the experimental results show that the bandwidth of the server is 
enlarged by the Net-RAIDs. 



2 Architecture of Multimedia Server System with Net-RAIDs 

A multimedia server is designed with the network attaehed RAID. It is shown in 
Fig. I. All Net-RAIDs are eentrally eontrolled by the server through the SCSI ehannel 
for the eonvenienee of management just like a normal storage system, while all net- 
work interfaees of Net-RAIDs are allowed parallel data transmission. By keeping the 
SCSI ehannel of Net-RAID eonneeted to the multimedia fde server to exert eentral 
eontrol, it strikes a good balanee between a eentralized fde management and a dis- 
tributed data storage. 




Fig. 1. Multimedia Server with Net-RAIDs 



Storage system eapaeity must keep paee with the eontinuous growth of multimedia 
data. The system in Fig.l aehieves this eapaeity sealability by expanding the system 
storage eapaeity inerementally with additional Net-RAIDs along with assoeiated 
network interfaees that expand data transmission rate proportionally. 



3 The Redirection of Data Transfer 

All Net-RAIDs in the system are eolleeted as a virtual storage pool. The virtual stor- 
age protoeol eonsists of virtual layer, logie map layer and data redireetion layer. The 
virtual layer is used to simulate a standard bloek deviee driver and register the virtual 
storage pool. The logie map layer provides standard interfaee with bloek buffer eaehe 
and realizes the address map of the virtual storage pool. Different logie map layer 
leads to different virtual storage pool funetions. The data redireetion layer provides an 
interfaee for physieal deviee drivers, and redireets the data requests from the host to 
the physieal deviees. 

For example, a File Transfer Protoeol(FTP) session eonsists of two eonneetions. 
One is eontrol eonneetion for a elient to eonneet with the server and it is kept in the 
whole session. Another is data eonneetion for the server to transfer data with a elient 
and it is established when data should be downloaded/uploaded to the server. Beeause 
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FTP uses different logieal ehannels to transport eontrol and data paeket, we move the 
logieal ehannel of data eonneetion from the server to the physieal network ehannel of 
Net-RAID. 

When the elient downloads a file, the server parses the data information (start ad- 
dress and data length) of the requested file over SCSI ehannel. Afterwards, the server 
sends the data information and elient information to Net-RAID over the network. A 
data eonneetion is established between the Net-RAID and the elient. Net-RAID gets 
the requested data from SCSI disks in terms of the data information, and transfers the 
data to the elient direetly aeeording to the elient information. 



4 Performance Measurement 

In order to get a performanee eomparison between the prototype and the traditional 
system where the disk array is only attaehed to the server, we eonfigure a HUST- 
RAID'^^^ that has the same hardware platform as Net-RAID, exeept for NIC. The 
HUST-RAID is direetly attaehed to the multimedia server through the SCSI ehannel. 
Peak read and write performanees of HUST-RAID are 46MB/s and 33MB/s, respee- 
tively. The multimedia server is eonfigured as a FTP server and it is eonneeted to the 
100Mbps Ethernet. 

The performanee of the system is measured by the aggregate bandwidth when a 
number of elients download/upload files from/to the server simultaneously. Table 1 
shows the performanee eomparison between the prototype and the traditional system. 
The aggregate bandwidth of the prototype is larger than that of the traditional one and 
it approaehes the network bandwidth. In multi-user environment, the data transfer rate 
is 2~3 times higher than that with a traditional disk array. 



Table 1 Performance comparison between the prototype and the traditional system 



operation 


Number 

of 

clients 


Traditional System 


Prototype system with Net-RAID 


Data 

transfer 

rate(MB/s) 


Average 

rate 

(MB/s) 


Aggregate 

bandwidth 

(MB/s) 


Data 

transfer 

rate(MB/s) 


Average 

rate 

{MB/s) 


Aggregate 

bandwidth 

(MB/s) 


Download 

File 


1 


6.75 


6.75 


6.75 


7.36 


7.36 


7.36 


2 


2.83 


2.62 


5 


4.82 


4.67 


9.33 


2.41 


4.51 


3 


1.28 


1.25 


3.76 


3.28 


3.34 


10.01 


1.12 


4.01 


1.36 


2.72 


4 


0.88 


0.82 


3.29 


2.62 


2.48 


9.93 


0.93 


2.35 


0.76 


3.11 


0.72 


1.85 



When we add another Net-RAID to the prototype system, the aggregate bandwidth 
is nearly 20MB/s(see Table 2). It shows that the performanee of the system inereases 
almost linearly with the inerease of the number of Net-RAIDs, and the system bottle- 
neek has been removed from the server to network. 
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Table 2 Performance comparison when two Net-RAIDs are in the system 



Operation 


Number 
of clients 


Traditional Architecture 


Prototype system with Net- 
RAID 


Data transfer 
rate(MB/s) 


Aggregate 

bandwidth 

(MB/s) 


Data transfer 
rate(MB/s) 


Aggregate 

bandwidth 

(MB/s) 


Download 

File 


3 


3.23 


9.94 


6.67 


19.03 


3.32 


6.85 


3.57 


5.51 






4.50 




6.37 




Upload 


3 


2.96 


10.10 


5.35 


18.40 


File 




2.64 




6.68 





5 Conclusions 

An innovative network attaehed Disk array arehiteeture, ealled Net-RAID, is pro- 
posed and implemented. It adds a network ehannel to the RAID and data ean be trans- 
ferred between the Net-RAID and elients direetly. The arehiteeture removes the 
server bottleneek and dynamieally inereases system bandwidth with the expansion of 
storage system eapaeity. Experimental results provide useful insights into the per- 
formanee behavior of the system. The arehiteeture ean also be adopted to transfer 
massive data in other different servers, sueh as database server, HTTP server and so 
on. 
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Abstract: Network based data management/backup/restore is the key compo- 
nent in the data storage centre. This paper proposes a new network based data 
management — NDMP-Plus. We firstly discuss the components of the NDMP- 
Plus architecture. Then, we detail two new techniques in NDMP-Plus — VSL 
(Virtual Storage Layer) and the negotiation mechanism. VSL is the core com- 
ponent to implement the flexibility, which could avoid the network communi- 
cation with the storage media directly. And the negotiation mechanism is the 
key mechanism to improve the performance. Furthermore, we carry out an ex- 
periment to evaluate the performance of NDMP-Plus. The result of it suggests 
that NDMP-Plus has stronger flexibility and higher performance than the origi- 
nal NDMP. 



1. Introduction 

In modem data storage centre, it is too difficult for the administrator to man- 
age/backup/restore over thousands of millions of data using the distributed file sys- 
tems (e.g. NFS, CIFS and DAFS [1]). To implement data backup/restore management, 
NDMP (Network Data Management Protocol) is then introduced [2]. However, be- 
cause of the lack of flexibility of the NDMP framework, we introduce new techniques 
for NDMP in this paper to enhance the flexibility and performance. In particular, we 
propose the NDMP-Plus prototype originating from NDMP but differing in the inte- 
rior design of the architecture. Compared with NDMP, the NDMP-Plus prototype has 
two new features as follows: 

(1) NDMP-Plus introduces VSL (Virtual Storage Layer), by which it can erase 
the difference between the data service and the tape service of NDMP and provide a 
new uniform service — the data-l- service used as either the data provider or backup 
server dynamically. VSL could also avoid the network communication with the stor- 
age media directly. 

(2) NDMP-Plus provides one negotiation mechanism, by which it can decide the 
data transmission format dynamically and enhance performance (e.g. the data can be 
transmitted in the form of the tape device or the fde-system to the data backup server). 
While NDMP merely uses tape device format to transmit data, which will bring some 
unnecessary processes in some cases (e.g. backup data from one NAS to another) and 
decline performance. 
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2, The Architecture of NDMP-Plus 




Fig. 1. The basic architecture of NDMP-Plus 



The NDMP-Plus basic architecture, as shown in Fig. 1, provides DMA (Data Man- 
agement Application) and the data+ service. The administrator uses DMA to manage, 
backup and restore data. The data+ service is a uniform service for all the NDMP- 
Plus compliant hosts. Compared with the data service and the tape service of NDMP, 
the data+ service of NDMP-Plus provides the uniform interface to all storage devices. 
In the glossary of NDMP-Plus, there is no primary or secondary storage device; all 
storage devices are the same to the data+ service and classified by the working 
method of the storage media. 




Fig. 2. The Components of The Data+ Service 



A data+ service is composed of the VSL module, the network module and the 
storage modules, as shown in Fig. 2. A fundamental issue in the data+ service is how 
the backup/restore manipulation and the storage media are associated. To address it, 
this paper provides a media-independent module — the VSL module. VSL instances 
one storage media to a VSL device entity and provides the information of that storage 
media to the network modules. The storage module involves one or more sub-storage 
media module. Each kind of sub-storage media modules is classified by the read/write 
method and the storage format of data. 



3, The Implementation of VSL 

VSL is a key component to realize the flexibility of the NDMP-Plus prototype, man- 
age the storage modules and provide a series of uniform VSL interfaces. The internal 
frame of VSL is shown in Fig. 3. 
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VSL employs the pair — {MedialD, DevieelD} to identify every logie storage 
partition the administrator ean aeeess. Both MedialD and DevieelD are globally ex- 
elusive numbers in the seope of VSL, henee DMA ean use DevieelD to aeeess the 
right logie storage partition and VSL ean use MedialD to redireet the aeeess to the 
real storage media. In praetiee, VSL provides a stmeture (VSLStorage) to deseribe an 
abstraet storage module and a stmeture (VSLPartition) to deseribe a logie storage 
partition. And then the two straetures eooperate with eaeh other to implement the 
management of VSL. 




VSL READ 



VSL VVRI TE 



VSL Cont r ol 



Fig. 3. The Internal Frame of VSL 



4, The Negotiation Mechanism 

In the glossary of NDMP-Plus, one session means that one eonneetion between the 
data provider server and the data baekup server ean be used to do only one type of the 
proeedure (either baekup or restore) more than one time. Compared with NMDP, at 
the outset of one session, NDMP-Plus need do the negotiation meehanism for this 
session, as shown in Fig. 4. 



Pr ocedur e 
Sour ce 



Pr ocedur e 
Dest i nat i on 




Fig. 4. The Negotiation Procedure 



We take the baekup proeedure for example to illustrate the negotiation meeha- 
nism. After one eonneetion is established, firstly, the proeedure souree inquires the 
proeedure destination to provide the transmission form, and then the proeedure desti- 
nation provides the form list to the souree. Seeondly, the souree seleets the appropri- 
ate form as the transmission form of this session and notifies the destination. Finally, 
this baekup or restore proeedure ean start up. The restore proeedure is almost the 
same as the baekup proeedure, in whieh the baekup server is the proeedure souree. 
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Based on the above diseussion, the baekup proeedure is the same as the restore 
proeedure from the network eommunieation viewpoint. Through the negotiation 
meehanism, the proeedure destination provides the transmission form to negotiate. 



5. Evaluation and Conclusion 

The NDMP-Plus prototype is implemented on the FreeBSD4.7 operation system. In 
our experimentation, we use two NAS boxes as the data provider host and the data 
baekup host. We have done three types of testing methods to prove the performanee 
enhaneement of the NDMP-Plus prototype: (1) Using the traditionally distributed fde 
systems — CIFS, the administrator in Client baekups the data from NASI to NAS2; 
(2) Using NDMP, the administrator uses the NDMP DMA to baekup data from NASI 
to NAS2; (3) Using NDMP-Plus, the administrator uses the NDMP-Plus DMA to 
baekup data from NASI to NAS2. 

Our testing data eome from the file-system — “/usr”. The results are shown in 
Table 1. 



Table 1. The "/usr" Performance of Client Benchmarks 



Method 


Time- 

eonsumer (s) 


Transmis- 
sion Speed 
(KB/s) 


Network 
traffie (KB/s) 


CPU 

Utilization 

(%) 


CIFS 


1813 


399.36 


794.7 


30-50 


NDMP 


832 


870.4 


About zero 


About 2 


NDMP-Plus 


759 


952.32 


About zero 


About 2 



Based on the data of Table 1, we ean reaeh the eonelusion as follows: 

(1) Compared with CIFS, the average speed of the NDMP-Plus baekup methods 
speedup 90%. The reason is that the data of NDMP-Plus are transmitted direetly 
between NASI and NAS2, but the data of CIFS are transmitted from NASI to the 
elient, and then to NAS2. 

(2) Compared with NDMP, NDMP-Plus improves the transmission speed by 
about 9%. The reason is that NDMP-Plus provides one negotiation meehanism to 
seleet the data form; henee it ean use an appropriate form to transmit data more effi- 
eiently, avoiding unneeessary steps. 

We present a newly designed network based data management prototype 
(NDMP-Plus) in this paper. It has a more flexible arehiteeture and a higher perform- 
anee than NDMP. The future work may foeus on further enhaneing the performanee 
of NDMP-Plus and implementing the snapshot teehnology [3]. 
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Abstract. Watershed segmentation/transform is a classical method for image 
segmentation in gray scale mathematical morphology. Nevertheless watershed 
algorithm has strong recursive nature, so straightforward parallel one has a very 
low efficiency. Firstly, the advantages and disadvantages of some existing par- 
allel algorithms are analyzed. Then, a Further Optimized Parallel Watershed 
Algorithm (FOPWA) is presented based on boundary components graph. As the 
experiments show, FOPWA optimizes both running time and relative speedup, 
and has more flexibility. 



1 Introduction 

Watershed segmentation/transform is a classieal and effective method for image seg- 
mentation in gray scale mathematical morphology. This method, with a wide perspec- 
tive, has been applied successfully into some fields like remote sensing images proc- 
essing of satellite and radar, biomedical applications and computer vision. However, 
watershed transform is a relatively time consuming task for its low efficiency, and in 
above fields, such as in remote sensing applications large size images, 
e.g. 1024 X 1024 , 3000 x 3000 or larger, are not uncommon and must be processed in 
real time usually. Therefore, to study watershed algorithm easy to be paralleled is 
meaningful in real applications. 



2 Related Work 

Meijster and Roerdink had proposed a three-stage parallel watershed algorithm (M-R 
algorithm for short) in [1] based on components graph, which was designed for a 
ring-architecture with shared memory. But there are some potential logic errors/limits 
in M-R algorithm, as shown in [2]. Therefore, [2] pointed out an improved parallel 
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watershed algorithm (IPWA for short) for distributed memory system, which got 
better performance. With the further study, we find that the adaptability of these two 
algorithms is limited: 1) The parallel efficiency of two algorithms is very low when 
they meet images with content of large size objects. 2) They are only designed for the 
segmentation of the images containing many plateaus with large area. 3) Because of 
simplified computation of plateaus, algorithms probably end up with images that 
contain thick watersheds, which need post-processing. 

Moga and Cramariuc etc. had given some parallel methods of watershed transform 
based on definition by topographical distance [3]. We have learned from [4] that the 
proposed method based on Ordered Queue (OQ for short) is derived from optimal 
sequential watershed algorithm, but its scalability is quite limited. While an alterna- 
tive solution, namely image integration by sequential scanning, introduced by [5], 
provides an equitable work load on multiprocessors, and hence a better relative 
speedup, but the absolute running time of this algorithm is very long. And then litera- 
ture [6] proposed a method named rain-falling or hill-climbing simulation, which 
reduced re-scanning overhead through computing lower-complete image, but intro- 
duced undesirable overhead caused by the lower distance computation and preserved 
data dependent character of the algorithm. In addition, these algorithms do not con- 
struct watershed lines, but only labeled regions [3], needing post-processing. 



3 An Optimized Parallel Method: FOPWA 

Considering positive and negative contributions of above algorithms we proposed a 
Further Optimized Parallel Watershed Algorithm (FOPWA) based on definition of 
watershed transform by topographical distance. Topographical distance based water- 
shed transform is started by detecting minima of the input data (called seed-pixels)', 
ordered region growing is then performed according to lower distance. Lower dis- 
tance is formally defined as following: 

Definition 1. LetDcZ^,/be a digital gray value image in domain D, and G be 
underlying grid of / / (p) denotes the gray value of pixel p. Lower distance d is 
defined as: d{p) = Q\f is a minimum; otherwise, d{p) equals the length of the 
shortest path {p = pQ,p^,...,p^ = q] from to a pixel q such that Vi g {l,2,...,^}, 
iPi-\ ,Pi)^G, fiq) < f{p) , and, if ^ > 1, Vi g {1,2,..., ^ - 1}, /(/?,. ) = f{p,_^ ) . 

Each non-seed pixel is put into different catchments (regions) in an increasing order 
of gray levels. This recursive label propagation is called flooding. We define a 2D 
ordering relation to satisfy parallel requirement for flooding. 

Definition 2. 2D ordering relation can be formulated by two conditions: Condition 1. 
If / {q) = min {/ (r) \f{r)< f (p)} , then q is the preceding-pixel of p (also called p 

rsNa(p) 

is flooded by q), andL(p) = L(q) . L{.) denotes the output label image. Condition 2. 
lff{p) = f{q)andid{p) = d{q) + l, then Z.(/?) = L(^) .Where <i(/7)stands for lower 
distance of p with initial value oo {d{p) = oo ), which denotes a maximum value, and 
N(j {p) for the set of neighboring pixels of p with respect to surrounding pattern. 
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We also assigned a unique label to eaeh of boundary pixels with preeeding-pixels, 
whieh are looked as seed-pixel like minima. Consequently, based on this 2D ordering 
relation, eaeh proeessor ean eorreetly and exhaustively delimit the extent of regions 
erossing the loeal sub-domains, regardless of what is happening in other proeessors; 
henee parallel eomputing eould be realized. We name these additional seed-pixels as 
pseudo seed-pixels. Then, we define a Boundary Components Graph (BCG) to reeord 
the ordering relation between pseudo seed-pixel and its preeeding for flooding and 
merging in later steps. Note that BCG is only related to boundary pixels of eaeh sub- 
domain, not all pixels in the input image, so the size of eomponents graph is redueed. 
Definition 3. Considering the input image /as a direet valued graph G = {V, E,f), in 
whieh V is the set of pixels in the graph, and E is the set of edges of the graph defin- 
ing the eonneetivity. BCG G* =(F*,f?*,/*) (/* = / ) ean be defined as following: 
1) If vgFaZ(v) = H or T(v) 7^ H A (3/7, e F A e (v) A T(/?) = H) , then 
veF*. 2 )Vm,vgF* , if L{u)*n aT(v) = H, and u and v satisfies Condition 1 or 
Condition 2 deseribed above, then(M,v) e i?* .3) Vm,v e F* ,if 
L{u) ^ H aZ(v) = Ha f"{u) = /*(v) A d(w) = d(v) = CO , then {u,v)gE” . 

For the implementation, the global domain D of size X x 7 is split among N proees- 
sors in sub-domains D. ; FOPWA has four stages: 1) Deteeting real and pseudo seed- 
pixels and building loeal BCGs. 2) Loeal flooding. We use Ordered Queue (OQ) to 
realize the loeal watershed transform, for OQ is derived from optimal serial water- 
shed algorithm that ean obtain relatively shorter running time of loeal tasks. 3) Global 
merging. We use similar proeess to merging eomponents graphs as IPWA [2]. 4) 
Broadeasting the merging result to eaeh proeessor to updating the loeal results. 



4 Experiments and Conclusion 





Fig. 1. The comparison of FOPWA and IPWA in speedups. The left is the speedup curves of 
two algorithms for the test image Lena with size of 512x512. The right is for the test image 
Airport with size of 1024x1024 

Firstly, we realized FOPWA and IPWA [2] on two parallel platforms, and tested for 
various images with different size (256/512/1024/2048). Moreover, performanee of 
FOPWA is also eompared with existing algorithms in running time and speedup. One 
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of the two parallel platforms is a Cluster system with 16 nodes. Another parallel plat- 
form is YinHe supercomputer (YH for short), which includes 32 processors. 

Fig.l compares FOPWA with IPWA in speedup on two different platforms (dashed 
curves represent speedup trend for Cluster system when number of processors is over 
16). From this figure, we can draw some conclusions: 1) FOPWA outperforms 
IPWA, and has better scalability. 2) FOPWA is less data dependent. 3) The result got 
from YH is better than that from Cluster for YH has better network. 

FOPWA outperforms other existing parallel algorithms by combining the advantages 
of components-graph based method and distance based method but not introducing 
additional overhead. Table 1 compares performance of FOPWA with some existing 
parallel algorithms. As the experiments show, FOPWA optimizes both running time 
and relative speedup, and has more flexibility and adaptability, compared against old 
methods. 



Table 1. The comparison of FOPWA and other parallel algorithms in serial time and speedups 
on YH. Test image is Airport with size of 1024x1024, N is the number of nodes in system. 



Algorithms 


Serial time (s) 


Speedup(A=16) 


Speedup(A=32) 


Rigid OQ-based 


11.402 


5.548 


7.358 


Sequential scanning 


52.882 


14.964 


26.296 


Rain-falling 


5.958 


2.877 


4.532 


Connected component 


11.883 


11.777 


21.527 


IPWA 


1.445 


9.658 


11.709 


FOPWA 


1.312 


13.299 


24.876 
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Abstract. Using proxy cache is a key technique that may help to reduce 
the loads of the server, network bandwidth and startup delays. Basing on 
the popularity of clients’ request to segment video,we extend the length 
for batch and patch by using dynamic cache of proxy cache for stream- 
ing media. Present transmission schemes using dynamic cache such that 
unicast suffix batch, unicast patch, multicast patch, multicast merge and 
optimal batch patch by proxy cache based on segmented video. And then 
quantitatively explore the impact of the choice of transmission scheme, 
cache allocation policy, proxy cache size, and availability of unicast ver- 
sus multicast capability, on the resultant transmission cost. 



1 Introduction 

We consider that video stream from a server travels through Internet to the end 
clients. We assume that clients always request playback from the beginning of a 
video. The proxy receives the client request and, if the prefix video is available in 
the proxy, streams the prefix directly to the client. If the video is not present in 
the proxy, the latter will contact the server and streams the received data from 
the server to the client. We, according to the popularity of video application 
from the clients, store the segmented videos in proxy cache, therefore enhance 
the byte hit ratio of proxy cache. 

In order to raising the efficiency of proxy cache for streaming videos, we 
present the video segmented cache scheme based on the popularity of video re- 
quests from the clients. We segment the video into small parts, for being cached 
and substituted. Suppose the minimal cache allocation unit length is b, we seg- 
ment video into the multiple of b. The value of b is determined according to 
the startup delays in the Internet environments. In later discussion, we simply 
assume that the video length unit is b, therefore video unit can be independently 
cached and substituted. The length of the prefix cache unit is also b. According 
to the popularity of video requests, we cache the video segmentation of different 
size and insure that segmented cache can obtain maximal percentage of byte 
hits radio. We extend the length for batch and patch by using dynamic cache of 
proxy cache for streaming media. Present transmission schemes using dynamic 
cache such that unicast suffix batch, unicast patch, multicast patch, multicast 
merge and optimal batch patch by proxy cache based on segmented video. 
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2 Proxy Cache Based on Segmented Video 

The fi measures the relative popularity of a video: every access to the video 
repository has a probability of fi requesting video i. Let Xi be the access rate of 
video i and A be the aggregate access rate to the video repository.The storage vec- 
tor V = {V\,V 2 t ■ ■ ,Vn) specifies that a prefix of length Vi seconds for each video 
i is cached at the proxy, i=l,2,. . .,N. The storage vector U = {Ui, C/ 2 , ■ ■ ■ ,Un) 
specifies that a segmented cache of length Ui seconds for each video i is cached 
at the proxy, i=l,2,. . .,N. The length of video i cached at the proxy is Vi -h Ui. 
Let Cs and Cp represent respectively the costs associated with transmitting one 
byte of video data on the server-proxy path and proxy-client path. Ci{Vi, Ui) is 
the transmission cost per unit time for video i when the length of proxy cache 
is Vi -|- Ui- The Vi represent the stream rate of video i resource. 



2.1 Unicast SufRx Batch Based on Segmented Video 

Unicast suffix batch using dynamic cache based on segmented video is a simple 
batch scheme, which makes use of proxy cache to provide immediate playback. 
It is designed for the unicast transmission of video from the proxy to the clients. 
Suppose the first video request i arrives at time 0. The proxy will immedi- 
ately transmit the video prefix to the client. Unicast batch processing makes 
the transmission time from server to proxy be as late as possible, and ensures 
it the playback on the client side is continuous. That is to say, the first frame 
of suffix is scheduled to reach the client at time Vi. The length of prefix cache 
which depends on the network environment, but the other length Ui of cache 
in proxy cache is relative to the speed of customer’s request for this video, that 
is to say, we determine the length of cache Ui according to the popularization 
of video. For any request arriving in time (0, U -I- Ui), the proxy just forwards 
the incoming suffix (of length Li — Vi — Ui) to the client, and new transmitted 
suffix should come from server. Actually, several video suffix requests are trans- 
mitted in batch using dynamic cache. Segment-based unicast transmission and 
prefix cache can solve the problem of startup delays. But batch transmission 
using dynamic cache can increase transmission efficiency because its maximum 
unit is realized by dynamic cache windows and different from the others. Sup- 
pose there is a Poisson distribution process, and the average request number is 
l + {Vi + Ui)Xi in time [0, Vi -I- Ui], these requests cause the transmission of sufhx 
[Vj -I- Ui, Li] from the server, and the average transmission cost of video i is : 



c^{V„U,) 



L,-V- U, 

1 + (Vi -l- Ui)\i 



o^i)Xi 



( 1 ) 



2.2 Unicast Patch Based on Segmented Video 

Unicast patch for proxy cache based on segmented video using dynamic cache 
can save network resource. The first request of video i arrives at time 0, and 
the video non-stored in proxy cache arrives at time Vj -I- Ui from server (Fig.l). 
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Suppose another customer’s request arrives at time t 2 , Vi + Ui < t 2 < Li. One 
method is to read the video non-cached in proxy from server directly. The other 
is to use patch processing technique. We suppose to handle [Vi + Ui,t 2 ] from 
the patch of server, since segment [t 2 ,Li] have been scheduled to transmit, the 
patch is set at time t 2 + Vi + Ui, then the client must receive at the same time 
from two channels and deliver the content of the suffix and patch. So a patch 
depends on a suffix threshold Gi. Measured from the beginning of the suffix, if 
we begin delivering from the nearest suffix, and request the arrival within Gi , 
the proxy will schedules a patch from the server for it. Otherwise, it starts a new 
complete transmission of the suffix. Suppose a Poisson arrival process, between 
the initiations of the two consecutive transmission of the suffix, the average 
number is l+Xi{Vi + Ui + Gi). These requests are only in one transmission of the 
suffix [Vi + Ui,Li] from the server, the total length of patches from the server 
for these requests is: 



XiG, / x/Gidx = XGl/2 (2) 

Jo 

This is because, the distribution of arrivals in time interval [Vi + Ui, Vi + Ui + 
Gi\ follows a uniform distribution in a Poission arrival process. The average 
transmission cost of the video i is: 



^iiVi, Ui) — CsXiTi 



XiG\l2 + Li — Vi — Ui 
1 + X{Vi + Ui + Gi) 



CpXiViLi 



(3) 
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Fig. 1. Unicast patch 



Fig. 2. Case 1 



2.3 Multicast Patch Based on Segmented Video 

If the path from the proxy to the client is multicast transmission, proxy can 
use the scheme of segmented multicast patch using dynamic cache. Suppose 
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the request of video i arrives at time 0 (fig. 2), proxy begins to transmit video 
Vi + Ui hy multicast at time 0, the server begins to transmit suffix of video to 
the client at time Vi + Ui, and the proxy transmits received data to client by 
multicast patch using dynamic cache. If that Tj is the domain value that control 
the transmission frequency of whole stream. Suppose a request shortly after the 
stream transmitting arrives at t 2(0 < t 2 < Ti), the video transmission to the 
clients can be classified into two cases according to the relationship between 
Vi + Ui and Ty. 

Case V. Ti <Vi + Ui < Li (Fig. 2), the client receives [0, ^ 2 ] segment by using 
unicast from proxy through a single channel, and receives [^ 2 ,^ 1 ] segment by 
using a processing multicast. Suppose this is the Poisson distribution, in this 
circumstance, the value transmission function giiVi, Ui,Ti) is: 



1 + XiTi 



[{Li — Vi — Ui)cs + LiCp + 




(4) 



Case 2: 0 < Vi+Ui < Ti ( Fig. 3), if 0 < ^2 < Vi+Ui, transmission construction 
is the same as case l.liVi + Ui <t 2 <Ti, the client receives [0, Vi + Ui] segment 
by using unicast from proxy through a single channel, and receives [t 2 , Li] by 
using the processing multicast. The server transmits [Vi + Ui,t 2 ] through proxy 
to client by using unicast. Suppose its process is the Poisson distribution, then 
the transmission cost function g 2 {Vi,Ui,Ti) is: 



X+i 

1 + XiTi 



[{Li — Vi — Ui)Cs+LiCp+ 



X^{V + C/.)" , A,(r, -Vi- U,y 



(cs+Cp)] (5) 



We denote the cost of transmission with k(k=l,2), hk{Vi,Ui) is the 
minimum transmission cost of the transmission value function: hk{Vi,Ui)= 
min{gk{Vi,Ui,Ti), 0 < Ti < Li}. For a given Vi + Ui cache proxy, the aver- 
age transmission value is: Ci{Vi, Ui)=mm{hi{Vi, Ui),h 2 {Vi, Ui)}. 



Time 



Ihroshold 




from proxy, unica.st l2-Vi-Ui from .server 

t I 



I'rom proxy, multi ca.sl 




O^Vi+Ui^Ti 




Fig. 3. Case 2 



Fig. 4. optimal multicast patch 
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2.4 Multicast Stream Merging Based on Segmented Video 

The key issue of stream merging is deciding how to merge a later stream into 
an earlier stream. Closest Target policy [5] is one online heuristic policy whose 
performance is close to optimal offline stream merging. Our scheme for multicast 
Merge integrates dynamic cache and stream merging. It uses Closest Target 
policy to decide how to merge a later stream into an earlier stream. For a video 
segmentation required by the client, if a prefix of the segmentation is at the 
proxy, it is transmitted directly from the proxy to the client. The suffix not 
stored at proxy cache is transmitted from the server as late as possible while 
still ensuring continuous playback at the client. Let Pj be the probability of 
requiring a j-second , 0 < j < Li. This can be obtained by monitoring a running 
system. The average transmission cost of video is: 



Ci{V,,Ui) = 



Vi + Ui 

E 

j=b 



jPjhcp + ^ (j(cp + Cs) - {Vi + Ui)Cs))Pjri (6) 

j=Vi-Ui+b 



2.5 Batch Patch Based on Segmented Dynamic Cache 

Recently White and Crowecroft [8] have introduced the concept of optimal batch 
patch for prefix cache in order to minimize the average stream rate of the server. 
We present optimal batch patch using dynamic cache based on segmented video. 
The common clients, before applying to the server for conventional channel RM, 
are transmitted in batches as one interval. This interval is fixed, symbolized as 
Vi+Ui, with an optimal patch window W. RM denotes regular channel. It exceeds 
this window, initiating a new conventional regular channel RM is more efficient 
than sending the patch within Vi + C/i(Fig.4). A length of RM is transmitted as 
a startup interval of RM, while a non-RM length is transmitted as a non-startup 
interval. The RM length and the average interval of startup patch, it is that 
in the interval of two adjacent RM periods random chosen by computer. The 
average interval of RM length is: 

(1 - P)rW^ + (1 - P){V + U,)rW + {V + U,)rL 
2{V + U,)W + 

Here P = P(0) denotes the probability of getting 0 request in the batch trans- 
mission interval Vi + Ui. L is the sustained time of video, r represent the stream 
rate of video resource, the optimal patch window is achieved by differentiating 
R and letting the result to be 0 request. The result is: 



-{V + Ui) + ^P{v + Ui)^ + 2(1 - P)(v + U,)L 
{V + Ui){l-P) 



(8) 



The optimal batch patch uses dynamic cache to transmit the request of batch 
and patch in interval Vi + Ui. The value of b for prefix cache can not be too large, 
generally in seconds. If the value is too large, that will extend the clients’ waiting 
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time. In proxy cache based on segmented video, by utilizing dynamic cache 
technique, the total length of prefix and segmented cache in the proxy cache is 
considered as the unit of batch transmission and patch. If t 2 < Vi + Ui < W, 
there exists multiple patch streams with batch of Vi + Ui, and the dynamic 
cache, which is in proxy cache, completes processing the patch within the exact 
time of the patch streams, then the value from the server to the proxy becomes 
l/{l + {{Vi + Ui)Xi).lit 2 < W < Vi + Ui, there exists multiple patch streams with 
W, and use of length W for dynamic cache in proxy cache, completes processing 
the patch within the exact time of the multiple patch streams, then the value 
from the sever to the proxy becomesl/(l + IVA). These two cases can both reduce 
the output from the server and enhance the efficiency of proxy cache. 

3 Conclusions 

We present the segmented proxy cache technique based on the popularity of 
video, and extend the length of batch transmission by using dynamic cache. 
Further more, we put forward several schemes which are based on video segment 
such as unicast batch transmission, unicast patch, multicast patch, multicast 
stream merging and optimal batch patch schemes, and study the impact on the 
transmission cost caused by the choice of transmission schemes. The evaluation 
exposits that even to relative small proxy cache, elaborately designed transmis- 
sion scheme, with dynamic cache based segmented video, can produce significant 
cost saving. Its performance is prior to the optimal prefix cache, proportional 
priority cache and 0-1 cache schemes. 
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Abstract. The batch is an important technique for delivering video over 
Internet or VoD. It is a key method to improve effect for video multicast. 
In this paper, we research the batch strategy of proxy cache for streaming 
media using dynamic cache, proposed the three kind of cache algorithm 
for proxy cache: window batch, size batch, efficient batch. These meth- 
ods increased the length of batch, solved the problem for latency time 
of batch in video muticast, improved the byte-hit ratio of proxy cache 
for streaming media, and economized the resources of network backbone. 
Event-driven simulations are conducted to evaluate these kinds of strat- 
egy are better than prefix cache and segment cache. 



1 Introduction 

The batch is one important technique for transmitting video multicast and VoD. 
It makes it possible to transmit video and the media required by all the clients 
within b of the duration time. Therefore through only one transmission, multi- 
application within duration time b can be satisfied. Although this method saves 
the transmission resource, users need to wait during the time interval b. What’s 
more, in the process to transmit video, the length of b is not allowed to be too 
long, which limits the use of batch technique. In the paper, we make use of the 
technique based on the proxy cache for streaming media and dynamic cache to 
solve the following problems: to lengthen time of batch, enhance the byte-hit 
ratio of proxy cache, ensure that we can transmit the video to the users without 
waiting time. 



2 The Batch Using Dynamic Cache for Streaming Media 

In the proxy cache based on segmented streaming media[2], we cache the prefix 
first to ensure there’s no startup delay. The segmented cache, employing pre- 
fetch technique caches only a part of video, according to the popularity for the 
users’ requests. When two or more users apply for the same video, we can save 
network resource as long as we can transmit the entire requirement at a time. 
In proxy cache, we employ dynamic movable time windows to cache video, and 
make replacement by using an algorithm called FIFO in order to make sure that 
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applicants can directly receive video from the proxy cache, needless to wait. In 
this paper, we put forward three batch strategies: window batch, size patch, 
efficient batch. 



2.1 Window Batch 

If the time length of the video is Lh minutes, then the simplest strategy of batch 
is used to transmit the video in batch with a set length in the window. We set 
W to be the width of batch for each video clip, and Lh/W will determine the 
times of multicast required. When two users (or more) apply to video, than that 
video required by the very first applicant is transmitted through the server by 
the proxy. The latter users required the video that is transmitted by the proxy 
cache. If the average request ratio reaches A, then the average of the users who 
require video service can be simply believed to be AW. If two or less applicants 
apply for a certain video when we are window batch of the video, then the data 
stream in batch can be omitted. There exists one problem about how to select. 
The number of applications from the users is uncertain, which is determined by 
the random changes in the requesting process. When there are no more than two 
applicants within the window batch time, we don’t employ the batch technique. 
In the proxy cache of segmented streaming video, the length of the video cached 
in the proxy is 2®b. Set the length of the window based on window batch to 
be W, and W=2®b. when batch takes place in time W, then we should make a 
judgment whether the batch saves the value. Let us suppose the requests used 
of batch are respectively Ai,A 2 , ..., Aj, then the value of j/W should be smaller 
than the threshold value of proxy cache, otherwise we can not transmit the video 
used of batch. In this way, we can surely enhance the efficiency of proxy cache 
and simplify the replacement algorithm of cache. 



2.2 Size Batch 

Generally, for the service providers, to cover as many users as possible in the 
batch is their great concern, because the income of the system is directly related 
to the average length of batch. Set Cg represents the value of the video stream of 
batch when seeing video and N’ is the average of the users within batch. If P is 
the worth for each delivering video in the batch, then the worth of transmitted 
video in the batch is N’P. In order to maintain the income for network resource, 
N’P > Cs is necessary. Let us define K = C^/P, and N’>K, then the longer the 
length of batch, the larger income of the system. In size batch, there involves one 
question, automatic selection. Higher number of the users in batch results from 
higher arrival ratio. Low byte-hit ratio results in low yield of the system. Before 
multicast, the yield is ensured by the requests from M>K users in batch. The 
time length needed by dynamic cache in the proxy cache is the time brought by 
collecting M users. When M=2, the practicability is useful. 
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2.3 Efficient Batch 

In the batch process, the smaller capacity the proxy cache is taken up, the 
larger the quantity within batch. The higher the byte-hit radio of proxy cache, 
the higher the efficiency. The length of video with popularity A is Li, and that 
length in the proxy cache is Vi. The users make applications at time Aq, Ai, An 
in turn and ^o=0. When k=l,2,...,n, {Ak — A^-i) < LijXVi, we can ensure it 
is beneficial for every video request within batch. That is to say, we transmitted 
those requested video in batch which help to save network resources. 

3 Analysis for Batch Strategy 

We analyze the performance of the batch schemes mentioned above. We focus on 
a particular video in the system, with the request’s arrival process being Poisson 
with rate A. In the various schemes for evaluation, we are interested in the 
following closely related performance measure: (a) Its distribution is represented 
by /s(s), and the mean value is S’. The average of the concurrent stream indicates 
the average required in the system. Probably it is also related to the saved 
bandwidth, compared with pure VoD, because in the pure VoD, the average 
number of concurrent stream’s requested is XLh, the saved bandwidth can be 
defined as 77 = 1 — S' /{XLh). (b) The batch size N, both its distribution /Ar(n) 
and its mean N’. Note that N’ is related to S’ by the XTh= S’ N’. Indeed, by the 
above formula, within the display time of a certain video, the arrival average of 
the users’ requests can be simply represented by XLh, which is equal S’N’. (c) 
For a random user, the delay is represented by D, whose distribution /c(t) is 
expressed by D’. The proxy cache requires the smallest dynamically cache batch 
size for D. 



3.1 Analysis for Window Batch 

In the fixed window batch strategy, the distribution of the batch size is a Poisson 
process, expressed by N’= A W. The video is accurate batch once in every W 
second, therefore S’=L/i/W and rj = 1 — I/(AIP). D is a ~ U(0,W) uniformly 
distribution and D’=W/2. in fixed gating, the possibility of the request i(i > 
l)in batch is given by XWe~^'^ — e~XW)), therefore N’=AIT/(1 — e~^^) 
and 

( 1 ) 

As expect ,when A — > 0,5" — > XLh, and when lim\ ,ooS' = LhjW, then 

S’< Lh/W. The bandwidth saved is 77 = 1 — (1 — (1 — e“^’")(AIT)). Given one 
arrival in the batch window, and this arrival time in the window is distributed 
evenly, therefore the distribution function related to the users’ delay for a certain 
video is /i)(t)=l/W, for 0 < f < W and D’=W/2. T=0, when the proxy cache 
can’t transmit the video in the batch. In automatic selection, the distribution of 
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the batch size is P{N = i) = {\Wy {i — 1)!, for z > 1, therefore N’=l+A 

W and 



S' = 



XTh 



(2) 



1 + \w 

and rj = \W / (1 + AVU), the request distribution is given by the following formula: 



foit) 



1 

1 + AVb 



5{t -W) + 



AbU 1 
1 + XWW 



( 3 ) 



for 0 < t < W, where 6{t) is the usual impulse function with 6{t) = 0 for tyf 0 
and 5{t)dt = 1. It accounts for the first user in each batch having a delay of 
W while the remaining users in the batch having a delay uniformly distributed 
between o and W, then the function works out that for the first user of each 
batch, there is a time interval W. Therefore D’ can be given: 



D' = 



AIT + 2 
2(1 + AIT) 



IT 



( 4 ) 



The cache size of batch for proxy cache is D’, and the request number in batch 
is N’. 



3.2 Analysis for Size Batch 



The video films collect M requests at a time in batch, therefore N is decided, 
which is equal to M. As a result N’=M, S’=XLh/M ,rj = 1 — 1/M. Let W represent 
the batch period, which is variable at random and equal (M-1), the summation 
of exponential variables. Its distribution can be given: 



9w{W) 



x{xwr~\ _,^ 

M-2 



( 5 ) 



The users’ request duration can be given according to the movement status on 



W: 



foix) 
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( 6 ) 



D’=(M-1)/(2A), which is equal to half of average batch period. The cache size 
in batch is T=(M-1)/A. 



3.3 Analysis for Efficient Batch 



In the efficient batch scheme, N=j, its distribution can be given: 

P(N = j) = i^^ 



The users’ request distribution is 



foix) 




( 7 ) 

(8) 



All the requests is transmitted in this batch scheme. The time of duration is 
smaller than Li/XVi, when M=2. The proxy cache size T= A„, and the request 
number of batch is S(t)=n. 
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4 Performance and Conclusion 

We compared these three batch policies with the full video approach, the variable- 
sized segment approach, and the prefix schemes in terms of the impact they 
imposed on byte hit ratio and startup delay, in the following aspects: the cache 
size, the popularity of the video. 




The window batch of the proxy cache is the simplest and most practical strategy. 
Efficient batch of the proxy cache has the highest byte hit ratio(Fig.l) .Within the 
whole range of cache size, the effect of the dynamic cache approach, variable-size 
segment-based strategy and the prefix strategy all effectively solve the problem 
of startup delay. 

Efficient batch, size batch, window batch further improve the byte hit ratio of 
the proxy cache and solve the wait delay existed in batch for VoD. Efficient batch 
has the highest byte hit ratio. Window batch has the following characteristics: 
convenient usage, simply and practical replacement algorithm of cache, high byte 
hit ratio. These three kinds of batch strategies can enhance the efficiency of the 
proxy cache together with the segment strategy. 
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Abstract. Regions Of Interest (ROI) image coding is one of the most signifi- 
cant features in JPEG2000 for network applications. In this paper, a new ap- 
proach for ROI coding so-call Up-Down Bitplanes Shift (UDBShift) is pre- 
sented. This new method separates all bitplanes into three parts: Important Sig- 
nificant Bitplanes (ISB), General Significant Bitplanes (GSB) and Least Sig- 
nificant Bitplanes (LSB). The certain number bitplanes of ROIs are up-shifted 
to ISB based on different degrees of interest of every ROI. Then, partial BG 
bitplanes are downshifted to LSB according to encoding requirement. Finally, 
The residual significant bitplanes of ROIs and BG that are saved in GSB are 
not shifted. Simulation results show significant improvement in reduction of 
reduction of transmission time and enhanced flexibility at the expense of a 
small complexity. Additionally, it can support arbitrarily shaped multiple ROI 
coding with different degrees of interest without coding the ROI shapes. 



1 Introduction 

The functionality of ROI is important in applications where certain parts of the image 
are of higher importance than others. In such a case, these ROIs need to be encoded at 
higher quality than the background (BG). During the transmission of the image, these 
regions need to be transmitted first or at a higher priority, as for example in the case 
of progressive transmission. JPEG 2000 standard in [1] and [2] not only supports ROI 
coding firstly, but defines two coding algorithms that are called Maxshift (maximum 
shift) method [3] in part 1 and the general scaling-based method in part 2 along with 
the syntax of a compressed codestream. In these methods, a region of interest of the 
image can have a better quality than the rest at any decoding bit-rate. 

Although the Maxshift method is efficient, three disadvantages are inevitable [4]. 
First, this method requires decoding of all ROI coefficients before accessing bitplanes 
of the BG and uses large shifting values that significantly increase the number of total 
bit-planes to encode. Second, it is inflexible in interactive net browser. Third, It is 
difficult that this method handles multiple ROIs of any shapes. The general scaling- 
based method can support multiple ROI coding. But it needs to code every ROI shape, 
which not only improves coding complexity, but also restricts every ROI shape. 
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In this paper, a new method so-eall up-down bitplanes Shift (UDBShift) is pre- 
sented. This new method separates all bitplanes into three parts: Important Signifieant 
Bitplanes (ISB), General Signifieant Bitplanes (GSB) and Least Signifieant Bitplanes 
(LSB). The experiment results show that the UDBShift method has three primary 
advantages: (1) it ean support arbitrarily shaped multiple ROI ending with different 
degrees of interest without ending the ROI shapes; (2) it enables the flexible adjust- 
ment of eompression quality in every ROI and BG by using appropriate sealing val- 
ues based on bit rate; (3) it ean ensure that all ROIs ean be eneoded at higher quality 
than the BG based on requirement. So the UDBShift method is more effieient and 
flexible than the two standard methods in JPEG2000 for network transmission. 



2 Description of UDBShift Method 

2.1 UDBShift Method for Single ROI 

The UDBShift method is based on the ROI eoding theory that at low bit rates, ROI in 
an image is desired to sustain higher quality than BG, while at the high bit rates, both 
ROI and BG ean be eoded with high quality and the differenee between them is not 
very notieeable. First, the wavelet transform is performed, and the transformed eoef- 
fieients are eventually quantized. Then, all bitplanes are divided into three parts: 
Important Signifieant Bitplanes (ISB), General Signifieant Bitplanes (GSB) and Least 
Signifieant Bitplanes (LSB). The eertain number bitplanes of ROI ealled ISBs are 
upshifted. And some BG bitplanes so-ealled LSBs are downshifted based on eneod- 
ing requirement. Finally, The residual signifieant bitplanes of ROI and BG that are 
ealled GSBs are not shifted. Fig.l shows the ROI eoding eomparison of the 
UDBShift method and the PSBShift method [4] for single ROI of an image. 

Sign ( MSB) 

ROI I 

Sign ( MSB) 

bgH ^ 

Sign ( MSB) 

ROI I 

Sign ( MSB) 

bgH ^ 

Fig. 1. Comparison of PSBShift method (top) and UDBShift method (bottom) for single ROI 

Bitplanes that have not been sent in their entirety, in eaeh subband, are arithmeti- 
eally eneoded again skipping all eoeffieients that do not belong to the ROI. We ean 
eneode every bitplane using arithmetie eoding based on eontext modeling. At the 
deeoder, all bits higher than original MSB will be downshifted until them eome baek 
to real value and all bits lower than the original LSB will be up-shifted to real value. 
If ROI need to have higher quality than BG, the larger sealing value ean be used and 
the results would be eloser to the general sealing-based method. 
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2.2 Multiple ROI Coding Using UDBShift Method 

In an image, multiple ROI coding requires multiple ROIs to be coded with different 
quality according to different degrees of interest. Maxshift method may support mul- 
tiple ROI coding, but when the number of ROIs increases rapidly, large shifting val- 
ues that significantly increase the number of total bit-planes to encode is used. 
Largely increasing the scaling value of wavelet coefficients significantly reduces the 
compression efficiency. In the worst case, the scaling value of bitplanes may result in 
bit overflow. In addition, Maxshift only scales all ROI’s bitplane using same scaling 
value. The general scaling-based method can ensure multiple ROI coding in different 
scaling values, while this method needs to code ROI shape and is only supported by 
rectangle or ellipse ROI shape in current JPEG 2000 standard. The UDBShift method 
can support efficient multiple ROI coding by modulate the shifting value of ISBs and 
LSBs. Fig. 2 shows the method with scaling different bitplanes of three ROIs. First, 
this method ensures that all ROIs have the same scaling value Sr. And the certain 
number bitplanes of ROIs are upshifted to ISB based on different degrees of interest 
of every ROI (Fig. 2 shows that the important significant bitplanes are chosen as 
5 i= 6,52=553=4). Second, certain parts of BG bitplanes are downshifted to LSB based 
on encoding requirement. Finally, The residual significant bitplanes of ROIs and BG 
that are saved in GSB are not shifted. 
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Fig. 2. UDBShift method for three multiple ROI coding (5i=6,5'2=5 ^3=4) 



3 Experimental Results and Conclusions 

In Fig. 3, one figure gives multiple ROI coding results for Lena from low bit rates to 
mediate bit rates. Three ROIs are defined in image. The priority order of these ROIs 
is ROI-l>ROI-2>ROI-3. The up-shifted numbers of ISBs should be chosen as Sroi- 
i>Sroi-2>Sroi-3, e.g., ^roi-i= 6 , s-rom^S, s-rom^T. At low bit rates (e.g., bpp<1.0), all 
ROIs have the higher quality than BG. ROI-1 has the highest quality among three 
ROIs. When bit rates increases, BG quality increases quickly. This is because the up- 
shifted numbers of the ISBs of ROI-2, ROI-3 are not large enough. Three ROIs can 
reach the lossless quality firstly because some least significant BG bitplanes down- 
shift. The reconstructed quality (PSNR) of three ROIs is given in Fig. 4. 
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The proposed method ean support arbitrarily shaped multiple ROI eoding with dif- 
ferent degrees of interest without eoding the ROI shapes, whieh is very important to 
interaetive network transmission and the distanee servers based on large images. We 
expeet this idea is valuable for future researeh in ROI eoding and its applieations. 



Mutipte ROI coding results for Lena 




Fig. 3. Multiple ROI coding results for Lena from low bit rates to mediate bit rates 




Fig. 4. The reconstructed Lena image with three ROIs. ROI-1 is face region, the ROI-2 is 
feather region and ROI-3 is hat region using UDBShift method: 0.25 bpp (left), 1.0 bpp (right) 
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Abstract. The growing need for accurate and fast methods of DNA and protein 
determination in the post human genome era has generated considerable 
interest in the development of new microfluidic analytical platforms, fabricated 
using methods adapted from the semi-conductor industry. These methods have 
resulted in the development of the Lab-on-a-Chip concept, a technology which 
often involves having a miniaturised biochip (as an analytical device), with 
rather larger instrumentation associated with the control of the associated 
sensors and of fluidics. This talk will explore the development of new Lab-on- 
a-Chip platforms for DNA, protein and cell screening, using microfluidics as a 
packaging technology in order to enable advances in nanoscale science to be 
implemented in a Lab-on-a-Chip format. The talk will also show how system 
on a chip methods can be integrated with Lab-on-a-Chip devices to create 
remote and distributed intelligent sensors, which can be used in a variety of 
diagnostic applications, including for example chemical sensing within the GI 
tract. 



1, International Context of the Work 

The invention of the transistor enabled the first radiotelemetry capsules, which 
utilised simple circuits for in vivo telemetric studies of the gastro-intestinal (GI) tract 
[1]. These units could only transmit from a single sensor channel, and were difficult 
to assemble due to the use of discrete components [2]. The measurement parameters 
consisted of either temperature, pH or pressure, and the first attempts of conducting 
real time non-invasive physiological measurements suffered from poor reliability, low 
sensitivity and short lifetimes of the devices. The first successful pH gut profiles were 
achieved in 1972 [3], with subsequent improvements in sensitivity and lifetime [4, 5]. 
Single channel radiotelemetry capsules have since been applied for the detection of 
disease and abnormalities in the GI tract [6-8] where restricted access prevents the 
use of traditional endoscopy [9]. 

Most radiotelemetry capsules utilise laboratory type sensors such as glass pH 
electrodes, resistance thermometers [10] or moving inductive coils as pressure 
transducers [11]. The relatively large size of these sensors limits the functional 
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complexity of the pill for a given size of capsule. Adapting existing semiconductor 
fabrication technologies to sensor development [12-17] has enabled the development 
of highly functional units for data collection, whilst the exploitation of integrated 
circuitry for sensor control, signal conditioning and wireless transmission [18] has 
extended the concept of single channel radiotelemetry to remote distributed sensing 
from microelectronic pills. 



2. Current Activity at the University of Glasgow 

Our current research on sensor integration and onboard data processing has therefore 
focused on the development of microsystems capable of performing simultaneous 
multiparameter physiological analysis [19,20]. The technology has a range of 
applications in the detection of disease and abnormalities in medical research. The 
overall aim has been to deliver enhanced functionality, reduced size and power 
consumption, through system level integration on a common integrated circuit 
platform comprising sensors, analogue and digital signal processing, and signal 
transmission. We have already created a platform which comprises a novel analytical 
microsystem incorporating a four channel microsensor array for real time 
determination of temperature, pH, conductivity and oxygen (work pioneered by 
professor Cooper). The sensors were fabricated using electron beam and 
photolithographic pattern integration, and were controlled by an application specific 
integrated circuit (ASIC), which sampled the data with 10 bit resolution prior to 
communication off chip as a single interleaved data stream. An integrated radio 
transmitter sends the signal to a local receiver (base station), prior to data acquisition 
on a computer (work pioneered by Dr Cumming). We have now presented real time 
wireless data transmission from a model in vitro experimental set-up, for the first 
time. 

The sensors comprised a silicon diode to measure the body core temperature, whilst 
also compensating for temperature induced signal changes in the other sensors; an ion 
selective field effect transistor, ISFET to measure pH; a pair of direct contact gold 
electrodes to measure conductivity; and a three-electrode electrochemical cell, to 
detect the level of dissolved oxygen in solution. All of these measurements will, in 
the future, be used to perform in vivo physiological analysis of the Gl-tract. For 
example, temperature sensors will not only be used to measure changes in the body 
core temperature, but may also be identify local changes associated with tissue 
inflammation and ulcers. Likewise, the pH sensor may be used for the determination 
of the presence of pathological conditions associated with abnormal pH levels, 
particularly those associated with pancreatic disease and hypertension, inflammatory 
bowel disease, the activity of fermenting bacteria, the level of acid excretion, reflux to 
the oesophagus, and the effect of GI specific drugs on target organs. The conductivity 
sensor will be used to monitor the contents of the GI tract by measuring water and 
salt absorption, bile secretion and the breakdown of organic components into charged 
colloids. Finally, the oxygen sensor will measure the oxygen gradient from the 
proximal to the distal GI tract. This will, in future enable a variety of syndromes to be 
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investigated including the growth of aerobic bacteria or bacterial infection 
concomitant with low oxygen tension, as well as the role of oxygen in the formation 
of radicals causing cellular injury and pathophysiological conditions (inflammation 
and gastric ulceration). The implementation of a generic oxygen sensor will also 
enable the development of first generation enzyme linked amperometric biosensors, 
thus greatly extending the range of future applications to include e.g. glucose and 
lactate sensing, as well as immunosensing protocols. 




SeiiMr Chip 1 Seaw Chip 2 

. S_-- 




I 





Figure 1: (Top left) showing the ISFET, temperature and conductivity sensor (Chip 1, a,c) and 
the electrochemical oxygen sensor (Chip 2, h, d). Figures e and f show detail of the pH and 
oxygen sensor, respectively; (Middle) Schematic (top right) and photo (below) of the Glasgow 
IDEAS capsule. Although the capsule is currently too large to swallow, the hybrid approach 
towards its construction provides considerable experimental flexibility. It is estimated that the 
volume of the pill could be readily reduced by ca. 40% through careful layout of the packaging 
and surface mount; (Bottom) a recording of pH and temperature using wireless transmission of 
data from a model gut system, showing the importance of temperature sensing. As expected the 
response of the pH sensor has a Nernstian dependence on temperature (although the reverse is 
not true, and the temperature sensor does not have a pH dependence). The Figure also shows 
that there is no signal cross-talk across the capsule, between sensors, despite the fact that they 
are located proximal to each other on the chip (and share the same microsystem for signal 
collection and transmission). 
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Abstract In this paper, the thin film electrode disposable biosensors capable 
with low cost, high reliability, robustness, low volume sample and hand-held 
multichannel meter were developed. Various biomolecules, such as glucose, 
lactate, b-hydroxybutyrate, cholesterol, hemoglobin and creatine kinase in low 
volume (less than 3 pL) have been detected. It is significant for the applications 
in home health care, clinical diagnostics and physiological identification and 
physical performance of athlete. Biochips based on micro-electro-mechanical- 
systems (MEMS) technology supply novel biochemical analytical technologies, 
which offer many advantages including high sample throughput, high 
integration, and reduced cost. Biochips are rapidly developed in recent years. 
This paper also will show the research results of biochips based on MEMS 
technology, including DNA purification chips, DNA-PCR chips, capillary 
electrophoresis chips, PCR-CE chips, LAPS (light addressable potential sensor) 
for DNA detection, DNA SPR (surface plasmon resonance) and DNA FET 
(field effect transistor) sensors. These biochips have potential applications in 
health care diagnosis, environment monitoring, gene sequencing and high 
through drug screening. 



1, The Research and Development of Biosensors in lECAS 

It is known that biosensors will be integrated and miniaturized, and be used to replaee 
existing, more time eonsuming analytieal methods for monitoring and deteeting [1, 
2]. In this work, the thin film eleetrode biosensors with 2-eleetrode eonstruetion and 
hand-held meter were developed. The surfaee of the working eleetrode of the 
biosensor, modified with nanoseale materials of eleetrodeposited platinum or 
earboxymethyleellulose (CMC), has the porous performanee and has exeellent 
hydrophilieity, thus the eleetrode possesses huge surfaee and high eatalytie aetivities 
for eleetrolytie proeesses. 
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Various biomolecules, such as glucose [3], lactate [3], b-hydroxybutyrate [4], 
cholesterol [4], hemoglobin [7] and creatine kinase [8] in low volume were detected 
as shown in Table 1[5]. Compared with the strips in market and the existing 
analytical instruments, the produced disposable biosensors (Figure la) are capable 
with low cost, high reliability, robustness, low volume sample and the created 
portable multichannel meters (Figure lb) can be used for test human metabolites 
without reagents. It is significant for the applications in home health care, clinical 
diagnostics and physiological identification and physical performance of athlete. Also 
it has the potential applications in food, beverage, environmental, pharmaceutical, 
bioprocess and antiterrorism. 



Table 1. The biosensors characteristics achieved [5] 



Test Molecules 


Sample 

Solution 


Nanoscale 

Materials 


Test 

Time 


Measurement 

Range 


Correlation 

Coefficient 


Glucose 


Buffer 


Pt nanoparticals 


12 s 


0.5—12 mM 


0.998 


Serum 


Pt nanoparticals 


27 s 


1— 30mM 


0.965 


Lactate 


Buffer 


Pt nanoparticals 


15 s 


0.5— 15mM 


0.998 


Serum 


Pt nanoparticals 


25 s 


0.5— lOmM 


0.988 


P- 

hydroxybutyrate 


Buffer 


Pt nanoparticals 


20 s 


0.01— 4 mM 


0.999 


Serum 


Pt nanoparticals 


20 s 


0.01— 6 mM 


0.946 


Cholesterol 


Buffer 


Pt nanoparticals 


30 s 


0.1— 5 mM 


0.995 


Hemoglobin 


Buffer 


Nanoporous 

CMC 


90s 


lOpM— 3 mM 


0.995 


Creatine kinase 


Buffer 


Nanoporous 

CMC 


140 s 


8—800 U/mL 


0.980 


280 s 


8—800 U/L 


0.960 





imiH 

^ SRI 


.ifc. 


It 1 ' 

litcxtittf 


mi) 

mmum 


yRnns 

JDBh , 










(a) (b) 

Figure 1. The photos of the thin film electrode biosensors fabricated in laboratory 
throughput and the meter: (a) the disposable biosensors packed in vacuum; (b) the 
portable multichannel meter. 
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2, Current Research on Biochips in lECAS 

Biochips based on micro-electro-mechanical-systems (MEMS) technology supply 
novel biochemical analytical technologies, which offer many advantages including 
high sample throughput, high integration, and reduced cost. Biochips are rapidly 
developed in recent years [9]. The research results of biochips based on MEMS 
technology are described as below: 

(1) Two types of DNA purification chips based on solid phase extraction (SPE) 
technology have been fabricated and studied. Both two chips were used to 
purify the DNA from PCR products. The silicon chip was also used to 
purify DNA from yeast bacteria. 

(2) DNA-PCR chip were fabricated on glass and silicon substrates using 
MEMS technology (Figure 2a), and a portable temperature controller 
(Figure 2b) for PCR chip has been developed [10]. PCR reaction has been 
realized successfully in this system. 

(3) PDMS electrophoresis microchip was constructed by molding method. A 
novel method to fabricate PDMS sandwiched microfluidic chip was 
presented and the microchip has been demonstrated as a capillary 
electrophoresis device for double-stranded DNA (dsDNA) and amino acid 
separation (Figure 3) [1 1]. 

(4) Fundamental research of integrated chip, including materials compatibility 
and microfabricated process possibility has been done. PDMS sandwiched 
PCR-CE chip and PCR-CE electrochemical detection chip have been 
designed and microfabricated. 

(5) Novel technologies for DNA detection, including light addressable potential 
sensor (LAPS), SPR sensors and DNA FET, have been developed and good 
results have been obtained. 

The above research on biochips based on MEMS, has laid a foundation for 

potential applications in health care diagnosis, environment monitoring, gene 

sequencing and high through drug screening. 
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(a) (b) 

Figure 3. (a) The photo of PDMS microchip for DNA separation; (b) the separation 
image of DNA fragments labeled by SYBR Green I. [11] 
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In this paper, we address some open issues on intelligent sensor networks researeh. 
Reeent advaneement in wireless eommunieations and eleetronies has enabled the 
development of low-eost sensor networks, which is one of the most important 
technologies for 21st century. The sensor networks can be used in various application 
areas, such as security and surveillance applications, smart classroom, monitoring of 
natural habitats and ecosystems, medical monitoring etc. Although there have been 
great improvements in research on sensor networks, there still he some open issues 
need to be solved to make the whole system works well. First is sensor node platform. 
The key issue is about how to design and implement a kind of cheaper node than the 
Berkeley Motes and iPaq-based Sensor Node that are two famous platforms. Second 
is energy awareness. Energy efficiency is the crucial problem in sensor networks. 
There are various sources of energy consumption in sensor networks, such as 
Processing Unit, Radio, Sensors, Actuators. Some results show that the actuation 
energy is the highest and the communication energy is the next. Node-level 
techniques and Network-level techniques are employed for various efficient energy 
management methods. Third are the time and space issues. Time synchronization is a 
critical piece of infrastructure of sensor network. Almost all forms of sensor data 
fusion and coordinated actuation require synchronized physical time for reasoning 
about events in the physical world. The clock accuracy and precision requirements are 
often more crucial in sensor networks than in traditional distributed systems. The 
space issues comprise the node localization and sensor coverage problems. They are 
both essential to support services and applications in sensor networks. Fourth are the 
protocols of sensor networks. The protocol stack consists of the physical layer, data 
link layer, network layer, transport layer and application layer. The physical layer 
addresses the needs of simple but robust modulation, transmission, and receiving 
techniques. Since the environment is noisy and sensor nodes can be mobile, the 
medium access control (MAC) protocol must be power-aware and able to minimize 
collision with neighbors’ broadcasts. The network layer takes care of routing the data 
supplied by the transport layer. The transport layer helps to maintain the flow of data 
if a sensor networks application requires it. Depending on different sensing tasks, 
different types of application software can be built and used on the application layer. 
The last issue is the collaborative signal processing. The nodes in sensor network 
must collaborate to collect and process data to generate useful information. Important 
technical issues include the degree of information sharing between nodes and how 
nodes fuse the information from other nodes. Also one needs to consider the tradeoffs 
between the better system performance and the resource limitation in collaborative 
signal and information processing. The above all will make the system more 
intelligent. 
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Abstract. A self-configuring wireless sensor network (WSN) system will be 
presented. This “smart-dusf’ system, deployed in more than 500 installations, 
and based on the UC Berkeley, open-source TinyOS embedded operating 
system, is the most widely used WSN worldwide. Applications and their 
requirements and characteristics will be presented, along with markets and the 
latest technology. Key technical issues with real-world deployments will be 
explored. 
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Abstract. Wireless sensor networks have become increasingly popular due to 
the variety applications in both military and civilian fields. Routing algorithms 
are critical for enabling the successful operations of sensor networks. A number 
of routing algorithms have been proposed. However, all the routing algorithms 
are considered in isolation from the particular communication needs of the data 
management. This paper focuses on the design of the routing algorithms con- 
sidering the needs of processing data query in sensor networks. A query-aware 
routing algorithm is proposed. The algorithm has the following advantages 
comparing with other routing algorithms. First, it processes as many queries as 
possible while routing. Second, the broadcast is executed locally so that the en- 
ergy required by globe broadcasts is saved. Third, routing is executed by 
searching and generating a binary-tree and only two boundary nodes selected to 
broadcast message when broadcast is needed so that the number of broadcast is 
reduced dramatically and the cover range of local broadcast is increased. Fi- 
nally, multiple routing paths for many routing requirements are found by merg- 
ing routing requirements and through only one random walk in the sensor net- 
work. Experimental results show that the proposed algorithm has better per- 
formance and scalibility than other routing algorithms. 



1 Introduction 

Recent advances in micro-electronics and wireless technologies enable the creation of 
small, cheap, and smart sensors. In the past few years, smart sensor devices have 
matured to the point that it is now possible to deploy large, distributed sensor net- 
works in an ad-hoc fashion. These sensors monitor various measurements such as 
temperature, pressure, humidity, movement, noise level, chemical and etc. Such net- 
works pose new challenges in data processing and dissemination because of their 
limited resources such as processing ability, bandwidth and energy. Even though each 
single sensor has limited capabilities, the network consisting of a large number of 
such sensors are powerful enough to deal with complex monitoring missions. Wire- 
less sensor networks have become increasingly popular due to their variety applica- 
tions in both military and civilian fields ranging from battlefield surveillance to natu- 
ral habitat monitoring. Sensor networks are attracting more and more attentions. 

Routing is critical for enabling successful operations of sensor networks. Tradi- 
tional routing schemes are not suitable for sensor networks, so many new routing 
algorithms have been developed [1, 2, 3, 4, 5]. 
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The directed diffusion routing algorithm is proposed in [1], which provides a 
mechanism for doing a limited flood of a query toward the event and then setting up 
reverse gradients to send data back along the best route. This algorithm employs the 
techniques of the initial low-rate data flooding and gradual reinforcement of better 
paths to accommodate certain levels of network and sink dynamics. In order to find 
the best path, this routing algorithm resorts to flooding the query throughout the en- 
tire network. Directed diffusion results in high quality paths, but requires an initial 
flood of the query for exploration. 

The Geo-Routing algorithms were considered in [9] and [10]. Geo-Routing algo- 
rithms rely on localized nodes, and provide savings over a complete network flood by 
limiting the flooding to a geographical region, but they do not work without the geog- 
raphy information of sensor nodes. 

[2] proposed a random routing algorithm. Rumor Routing. Rumor routing intends 
to work in conjunction with diffusion, bringing innovations from GRAB'^"' and 
GOSSIP'^*^^ routing to this context. In Rumor Routing, each node maintains a 
neighbor table and an event table. The event table is generated by an agent. The agent 
broadcasts an event to the farther nodes and builds up an event table. A query can use 
the information in its neighbor’s table and the event table to form a route path. If the 
querying node has a path to the sink, then the sink is looked up in this path directly. 
Otherwise, a neighbor node of it is selected to continue querying. Rumor Routing 
requires to maintain a neighbor table and an event neighbor table, which consumes 
too much energy. 

[3] described TTDD, a Two-Tier Data Dissemination approach that provides data 
delivery to multiple mobile sinks. Each data source in TTDD proactively builds a grid 
structure which enables the mobile sinks to continuously receive data on the move by 
flooding queries within a local cell only. TTDD handles multiple mobile sinks effi- 
ciently, but it is only suitable for sensor networks without mobile sensor nodes. 

[5] implemented a cluster based routing algorithm. The basic idea is to divide sen- 
sor nodes in a network into some clusters. Each cluster is managed by a head. Rout- 
ing is executed by the head. In this routing method, not only does the maintenance of 
clusters require much cost, but also the head may become a bottleneck of the process- 
ing information and communication. 

[4] provided a routing mechanism to obtain the information in sensor networks, 
ACQUIRE, which considers each query as an active entity. This entity searches result 
by transmitting query (in random or other way) in the network. In ACQUIRE, a 
neighbor table is maintained dynamically at each intermediate node. This table con- 
tains the neighbors within d hops away from this node. Each active entity can use its 
neighbors to generate part of query result. When the query result is completely gener- 
ated, it is returned to the querying node along the reversed path. ACQUIRE generates 
efficient routes by selecting a proper d and refresh the frequency c, but it is not suit- 
able for the case with high refresh frequency (0.08<c<l). 

In summary, all the current routing algorithms, except the one in [4], considered 
the energy efficiency and extending the network lifetime in isolation from the particu- 
lar communication needs of the data management and loss the opportunity for cross- 
layer optimization to design and adapt routing algorithms to the particular routing 
needs of the data management layer. Although [4] considered the query processing 
while routing, it is not suitable for the case with high refresh frequency (0.08<c<l). 
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This paper focuses on the design of the routing algorithms taking account of the 
needs of processing data query in sensor networks. A query-aware routing algorithm 
is proposed. The algorithm has the following advantages comparing with other rout- 
ing algorithms. First, it processes as many queries as possible while routing. Second, 
the broadcast is executed locally so that the energy required by globe broadcasts is 
saved. Third, routing is executed by searching and generating a binary-tree and only 
two boundary nodes selected to broadcast message when broadcast is needed so that 
the number of broadcast is reduced dramatically and the cover range of local broad- 
cast is increased. Finally, multiple routing paths for many routing requirements are 
found by merging routing requirements and through only one random walk in the 
sensor network. Experimental results show that the proposed algorithm has high per- 
formance and scalibility than other routing algorithms. 



2 Query-Aware Routing Algorithm 



All nodes in sensor networks are assumed to be homogeneous and uniformly distrib- 
uted. S is the sink and D is the source. Routing from 5 to Z) is to find a multi-hop path 
from S to D. In this paper, routing not only is considered as a simple path seeking, but 
also executs queries with routing. We call this kind of routing as query-aware routing. 
Query-aware routing is considered as searching on a dynamic generated binary tree 
whose nodes are sensor nodes in the sensor network with S as the root and search 
target as leaves. Since D is the search target, D must be on a leaf Routing from S to 
Z) is a procesure of depth first (or width first) search. With an increase in the depth, 
the binary tree will increase. Once D is found, the routing is successful, D becomes a 
leaf node and the multi-hop path is generated. If predefined search depth is reached 
without finding D, the search fails and a failure message is sent to S. We first discuss 
some basic concepts and then give the query-aware routing algorithm DFRS in details. 
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Fig. 1 DFRS 



Fig. 2 Routing-Tree 
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A Routing Tree is a binary tree randomly generated for routing. The height of the 
tree is denoted as Hop End in the rest of the paper. The generation procedure of a 
routing tree in a sensor network is shown in fig. 1 . An example routing tree is shown 
in fig. 2. 

Sink node S is the node which sends routing request to the source node, which is 
the root of the routing tree. S is the start node of a route. 

Target node D is the destination node of a routing, which is on the leaf of the rout- 
ing tree. 

The minimum routing search depth. Hop End, is determined by the diameter R of 
the sensor network and the effective transmission radius r, i.e. Hop End = R/r. In 
actual applications, the search depth is often more than R/r, that is Hop End = 
(l+e)R/r, where 0<e<l is a constant. 

A boundary node of a node n is a node in the node set B, where B=C-c, C is the 
node set in the circle with radius r, c is the node set in the circle with radius r-a, r is 
the communication radius of n, and a<r is a parameter. 

In sensor networks, nodes exchange and transmit messages with their neighbors. A 
message has message type and message content. Message type includes RRQ, RRRQ, 
FR and NT. The message content of a message of type RRQ is composed by sink node 
id (SID), target node id (DID), query id (QID), query (Query), and current search 
depth H. The message content of a message of type RRRQ is composed by sink node 
id (SID), query node id (QID), and returned result (Result). The message content of a 
message of type FR is composed by target node id (DID) and Query or Result. A 
message of type NT is composed by node id (NID) and node type (NodeType). If 
NodeType=boundary, the corresponding node is a boundary node. 

After the sink sends RRQ request to its neighbors, the nodes that accept the RRQ 
message process the message content with the following algorithm DFRS. H in the 
RRQ is computed by (l+s)R/r. 

Algorithm DFRS 

Input: RRQ message , RRRQ message , FR message or NT message . 

Output: Send the processed RRQ, RRRQ, FR or NT message out. 

1 Messageprocessing (Message) {//function invoked after current node 

m receive message from i 

2 if (haveMultiMessage ( ) ) 

3 Merge the message to sigle message; 

4 while (Message) { 

5 if (Message . Type == RRRQ) { // Message is a message of RRRQ 

6 if (Message . Message . SID==m. ID) //m is the node which sends 
the message 

7 ProcessResult (Message) //process query result. 

8 else 

9 Send (Message, i) ; //transmit message RRRQ to parent node 

10 } 

11 else if (Message . Type == RRQ) { // Message is RRQ message 

12 if(m.ID == Message. DID) { // m is target node 

13 Result=Process (Message . Query) ; //process query 

14 CreateMessage (RMessage) ; //construct RRRQ message 

15 Send (RMessage, i) ; } //return query result to parent i 

16 if (Message . H>0 && HaveBoundaryNode (Message . QID) ) { 

17 Message . H- - ; 

18 CreateMessage (RMessage) ; //construct FR message, trans- 
mit RRQ 

19 if (m.B (1) ) 

20 Send (RMessage , m. B (1 )) ; //search sub-trees in left 
first order 

21 else if (m.B(r)) 
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22 Send (RMessage , m. B (r) ) ; //search sub-trees in right 
first order 

23 else { //select two boundary points of m, m.B(l) and 
m. B (r) 

24 SelectBoundaryNode (BoundaryNodeBuf f er) ; 

25 Send (RMessage , m. B (1 )); } // search sub-trees in left 

first order 

26 } 

27 else if ( IsBoundary (m) ) { //m is boundary node of i 

28 CreateMessage (RMessage) ; //construct RM message:m is 
boundary node 

29 Send (RMessage , i) ;} //inf orm parent, m is boundary node 

30 if (Mesage . H==0 ){ //reach max search depth, search right sub- 
tree 

31 Mesage. H++; //refresh H of RRQ 

32 UpdateMessage (RMessage) ; 

33 Send (Message, i) ; } 

34 } 

35 else if (Message . Type == FR) { 

3G Send (Message . Message , 0 ) //transmit PRQ to neighbors of m 

37 Timer (TO);} //waiting for result from boundary node 

38 else if (Message . Type == RM && Message . NodeType == boundary) 

39 Store (Message . NID) //store boundary node of m to Bound- 
aryNodeBuf fer 

40 Message = Message- >next ; 

41 } 

42 } 

Fig.l and Fig. 2 illustrate the routing procedure of DFRS. From the sink S, the 
routing tree is depth first traversed. The traversal of the routing tree is also a proce- 
dure of the generation of this tree. When traversing the routing tree, once the target 
node is found, the traversal is finished when the tree is generated. 

During the traversal of the routing tree, if multiple queries can be executed concur- 
rently, the efficiency will be improved. Based on the above idea, the routing requests 
from other nodes are merged dynamically when executing the routing. That is, once 
the routing requests are satisfied, these routing requests are merged to the executing 
routing request. In the following routing process, once the target node of some rout- 
ing is reached, then the query result is returned while the other routing requests con- 
tinue being executed until all the target nodes are found or the specified search depth 
is reached. 

In DFRS, the following strategies are used to reduce the number of broadcasts to 
save energy. 

(1) . Process as many queries as possible while routing so that the routes are suffi- 
ciently used and the required energy is reduced. 

(2) . Broadcast is executed locally so that the energy required by globe broadcasts is 
saved. 

(3) . Transmit message by boundary nodes so that the cover range of local broadcast 
is increased and the number of repeatedly received messages is decreased. 

(4) . Routing is executed by searching and generating a binary-tree. Only two 
boundary nodes selected to broadcast message when broadcast is needed, and thus the 
number of broadcast is reduced dramatically. 

(5) . Multiple routing paths for multiple routing requirements are found by merging 
routing requirements and through one random walk in the sensor network. 
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3 Experiments and Analysis 

In order to test the proposed algorithm DFRS, a simulation environment for sensor 
networks is built. The number of the nodes in the sensor networks is varied from 
1850 to 7400, and all nodes are distributed within an area of xx 750, where x is from 
1000 to 4000. For simplicity, the nodes in the network are uniformly deployed in a 
grid with size 20x20. 

The following assumptions are used for the experiments: (1) each transmitted mes- 
sage during routing is in one package; (2) it consumes one unit of energy to transmit 
one package; (2) no energy is consumed when a node receives a package; (4) effec- 
tive communication radius of all the nodes are 100 units of length; (5) initial energy 
of each node is 150 units of energy; (6) the target node and sink node of a routing are 
generated randomly. 
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Fig. 3 Routing Success Ratio, N=1 850 
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Fig. 4 Node Failure Ratio, N=1 850 





Fig. 5 Routing Success Ratio , the num- Fig. 6 Average Dissipated Energy, the 
her of routings is 1000 number of routings is 1000 



The first experiment is to investigate the success ratios of DFRS and ACQUIRE in 
a simulation sensor network with ^=1850 sensor nodes uniformly distributed in an 
area of 1000x750. DFRS and ACQUIRE are executed in the simulation sensor net- 
work. Fig. 3 shows the success ratios of DFRS and ACQUIRE while search depth 
varied from 13 to 150 without any failure node. From fig. 3, it can be seen that DFRS 
has higher success ratio than ACQUIRE. The success ratio of DFRS is much higher 
than ACQUIRE in case of the search depth being smaller. 
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The second experiment is to investigate the ratio of the disabled nodes during rout- 
ing in a simulation sensor network with A'^ISSO sensor nodes uniformly distributed 
in an area of 1000x750. Fig 4 shows the ratios of the disabled nodes by running 
DFRS and ACQUIRE while increasing the number of routings with same success 
ratio. Fig. 4 illustrated that the ratio of disabled nodes caused by ACQUIRE higher 
than that caused by DFRS, and thus DFRS keeps the sensor network having longer 
lifetime. 

The third experiment is to investigate the scalibility of DFRS in a simulation sensor 
network with the number of sensor nodes varying from 1850 to 7400 and the size of 
deployment area of the sensor nodes varying from 1000x750 to 4000x750 corre- 
spondingly in terms of the success ratio and energy consumption. Fig. 5 shows the 
success ratio of DFRS while the number of sensor nodes varies from 1850 to 7400. 
The experimental results show that the success ratio of DFRS keeps about 90% while 
the network size varies. Fig 6 shows that the average energy consumption of DFRS 
increases linearly when the size of the network varies. The experimental results tell 
that the scalibility of DFRS is very high. 





Fig. 7 Average Dissipated Energy, N=1000 



Fig. 8 Average Dissipated Energy, the 
number of routings is 100 



The fourth experiment is to investigate the average energy consumption of each 
routing in a simulation sensor network with A=1850 sensor nodes uniformly distrib- 
uted in an area of 1000x750. Fig.7 shows the average energy consumption of each 
routing caused by DFRS and ACQUIRE while the number of routing requirements 
increases. Fig. 8 shows the average energy consumption of each routing caused by 
DFRS and ACQUIRE while the network size increases. These experimental results 
illustrated that the average energy consumption of each routing caused by DFRS is 
much lower than that caused by ACQUIRE. 



4 Conclusion 

A query aware routing algorithm, DFRS, is presented in this paper. DFRS not only 
considers the energy efficiency and extending the network lifetime but also considers 
the particular communication needs of the query processing in sensor networks. The 
experimental results show that DFRS has better performance and scalibility. 
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Abstract. This paper describes the optimization of a sensor network by a novel 
Genetic Algorithm (GA) that we call King Mutation C2. For a given distribu- 
tion of sensors, the goal of the system is to determine the optimal combination 
of sensors that can detect and/or locate the objects. An optimal combination is 
the one that minimizes the power consumption of the entire sensor network and 
gives the best accuracy of location of desired objects. The system constructs a 
GA with the appropriate internal structure for the optimization problem at hand, 
and King Mutation C2 finds the quasi-optimal combination of sensors that can 
detect and/or locate the objects. The study is performed for the sensor network 
optimization problem with five objects to detect/track and the results obtained 
by a canonical GA and King Mutation C2 are compared. 



1 Introduction 

During the last four decades there has been a growing interest in algorithms that rely 
on analogies to natural phenomena. One type of such algorithms is the Genetic Algo- 
rithms (GAs) that imitate the principles of natural evolution [9, 7]. GA has been 
widely used for combinatorial optimization, structural design, scheduling and other 
engineering problems [8, 13]. 

In this paper we are approaching the problem of optimization of a sensor network 
by Genetic Algorithms from a practical standpoint: we are interested in obtaining the 
quasi-optimal solutions fast. The sensor network is comprised of randomly distributed 
unattended ground sensors that are remotely deployed and after deployment their 
location is known. Objects in a space are monitored by limited numbers of those low 
cost - low power sensors. The advantages of using several of those sensors outweigh 
the expected performance degradation since a system of several inexpensive sensors 
in the same area offers a redundancy that provides acceptable performance. The com- 
plete system consists of modules that perform self-organization, object tracking, track 
fusion, ID fusion, communication, etc. [4]. This paper focuses on the optimization of 
sensor selection performed by genetic algorithms in the Self-Organization module. 
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2 Optimization of a Sensor Network 

We are performing optimization of a sensor network. The network is eomprised of 
remotely deployed unattended ground sensors that ean be used for objeet deteetion, 
traeking and identifieation. A sensor ean be used for traeking an objeet, if this objeet 
resides in the sensor’s field-of-view (FOV) and if the sensor is turned on. The sensor 
network adapts its strueture in order to aehieve the goals speeified by a human. Sen- 
sor seleetion is often performed in order to minimize the power eonsumption of the 
sensor network, by ehoosing the sensors that need to be turned on or off at a given 
moment in time. 

The goal of optimization is to find sensors for traeking all the objeets identified in 
network objeetive (that ean be seen as the optimization goal) in a way that optimizes 
eertain metries. In ease of objeet traeking two metries should be optimized: the aeeu- 
raey of objeet traeking and the power utilization of the sensor network. This multi- 
objeetive optimization is performed by Genetie Algorithms. For eaeh objeet identi- 
fied in network objeetive, optimization has to find m sensors needed for aeeurate 
traeking of objeets. The value of m depends on the physieal eharaeteristies of the 
sensors used. 

Our problem falls in the eategory of eombinatorial optimization problems: the sys- 
tem has to ehoose tuples of sensors that need to be on. There is a need of one tuple 
per objeet and the same sensor ean be used for multiple objeets as long as these ob- 
jeets are within its FOV. If we have k objeets and we need m sensors per objeet, 1 to k 
m-tuples are needed. The size of the seareh spaee is deseribed by: 

SearchSpace = Y\ {NumberOfMTuplesForlthObject) =r7| pTT I ^ 1 

,=i /=iW i=i U". 

where ni is number of sensors that ean deteet objeet i. The seareh spaee is expo- 
nentially inereasing with the number of sensors and objeets, diseontinuous, with non- 
ordered (feature type) parameters. 



3 Internal GA Structure for Sensor Network Optimization 

In our design eaeh individual of the Genetie Algorithm population is eomprised of 
several genes. Eaeh of the genes eontains on sensor’s identifieation. All the sensors, 
whieh are ehosen by GA to be aetive at a given moment, have their identifieation 
eoded in the genes. There is a unique identifieation assoeiated with eaeh sensor and 
the genes use a binary eneoding for identifieation. 

The GA’s internal strueture (i.e. number of genes) depends on Network Objeetive. 
Whenever this objeetive ehanges, the number of genes of the GA also ehanges. Net- 
work Objeetive ineludes a list of suspeeted objeets and required operations assoeiated 
with them. If the operation is to loeate the objeet, there are as many genes as neees- 
sary for loeation, for example in ease of aeoustie bearing sensors this number is three 
(Figl). 
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Sensor redundancy clusters 




Fig. 1. Internal structure of GA for location 

When performing object tracking we encounter a multi-objective optimization 
problem. The fitness function of GA takes into account both objectives: maximiza- 
tion of the location accuracy (i.e. minimization of the position tracking error) and 
minimization of the network power consumption. The fitness function has the fol- 
lowing form: 

k I k 

Fitness =-{wy +^2 ' Z-P/ +Z PenaltyFoiEachE^ExceedingThreshold) (2) 

i=\ j=\ i=\ 

where Et (i=l,2,..., n) are the estimated position errors for ;-th object and Pj 
(j=l,2,. . m) are the power consumption of y-th sensor, k is the number of objects, / is 
the total number of selected sensors and Wy and W 2 are weights. The last term is a 
penalty added for each of position errors exceeding a predefined threshold. This 
penalty increases significantly the range of population fitness and thus improves GA 
convergence but solutions that exceed the penalty are still valid. For estimating the 
position errors (E;), we are using the GDOP error [6]. The smaller the GDOP error of 
a sensor triplet, the better the position accuracy of the object will be achieved. 



4 Genetic Algorithm with Special Reproduction Operators 

The difficulties inherent in GA design are to determine the stopping criterion, the 
proper GA population size, probabilities of crossover and mutation. The difficulty in 
determining the stopping criterion comes from the fact, that GA convergence is prob- 
lem dependent [6,8,15,17]. Wolpert et al. [15] presented a number of "no free lunch" 
(NFL) theorems and established that for any algorithm, any elevated performance 
over one class of problems is exactly paid for in performance over another class. Our 
goal is to obtain a quasi-optimal solution in the shortest possible time for the sensor 
network optimization problem. We make no claims in this paper to the generality of 
the GA developed and its speed of convergence for other problems. 
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4.1 Genetic Algorithm with King Strategy 

The King Genetic Algorithm that we developed has been inspired by the reproduction 
process of the bees. There are three kinds of bees: the queen, worker bees, and 
drones. If mated with drones, the queen’s eggs will become worker bees, otherwise 
they will become drones. In bees’ colonies the queen plays the most important role in 
generating the offspring: only she can lay eggs. Inspired by this phenomenon, a novel 
GA that we call King GA, was proposed [14]. In King GA, a special individual, the 
best individual in the population, is always selected in the reproduction process to be 
one of the parents. This reproduction process is shown in Fig. 2a. 




a b 



Fig. 2. : The reproduction in a) King GA; b) King Mutation. 



4.2 Special Mutation Operator 

In Genetic Algorithms, mutation was first introduced as an auxiliary operator to en- 
sure population diversity. Many papers [1,6] pointed out the importance of mutation, 
but the mutation methods proposed were very similar, the difference merely being the 
value of the mutation rate, or whether the rate was constant or adaptive. Our previous 
experiments with GAs [5] showed that when mutation performs a strong enough 
search, crossover is not necessary for finding the optimum of multi-modal functions 
with non-ordered parameters. Therefore we proposed King Mutation, a version of 
King GA in which only mutation takes place. The reproduction process of King 
Mutation is shown in Figure 2b. 

Mutation in GAs is the process by which one or more genes in an individual are 
modified. Generally, each gene is chosen for mutation with a probability of mutation 
P„ that is determined in the initialization step of the genetic algorithm. In the new 
mutation operator that we are proposing, called Mutation C2, exactly two chromo- 
somes of an individual are randomly selected to be mutated. Any number of genes in 
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a chromosome may undergo mutation. Each gene in a chromosome to be mutated is 
mutated with probability P„. King Mutation algorithm with only the mutation of type 
Mutation C2 is called King MutationC2. 

King GA with Mutation C2 is similar to Evolutionary Strategies [16], ES(1+X). In 
ES(1+X) algorithm, there is only one parent which is the best individual in the popula- 
tion; the parent generates X children; in the next reproduction process, the best indi- 
vidual from the parent and its X children is selected as the new parent to generate 
children. So King GA with Mutation C2 is very similar to ES(1+X), but the mutation 
method is quite different. In ES(1+X), the main reproduction operator is Gaussian 
mutation, in which a random value from a Gaussian distribution is added to each 
element of an individual’s vector to create a new offspring. 

There are some GA studies [2,10,12] which are similar to the GA we proposed 
here. Jones’ Crossover Hill-climbing algorithm proposed in [10] is similar to King 
GA. He compared several algorithms such as Standard GA, Bit-flipping Hill- 
climbing, and Crossover Hill-climbing. Crossover Hill-climbing algorithm with only 
one step (CH-IS) obtained the best result. Both CH-IS and King Mutation C2 have 
no crossover; they both employ only mutation operators and their mutations are quite 
different from the traditional mutation method. Another similarity is that in both 
algorithms, the best individual in the population is used for generating offspring. 
However the mutations performed in King GA with Mutation C2 and in CH-IS are 
dissimilar; another difference is the population size: CH-IS has a population of 2 
individuals only and King GA has a larger population. 



5 Experiment Descriptions 

In an attempt to examine the quality of the GA proposed, we performed a set of ex- 
periments that compared the performance of King Mutation C2 and canonical GA on 
optimization of a sensor network for five objects. In the experiments performed we 
used an area of 25 by 25 kilometers with 81 sensors uniformly distributed. Each 
sensor’s FOV is a circle with a radius of 5 kilometers and there are about 20 sensors 
that can detect each object. For each of the experiments performed the Percentage of 
Total Search Space (PTSS) covered by GA was computed using the following equa- 
tion: 



Percentage Total Search Space = 100 * FFE / SS„% (3) 

where SSn is the whole search space for n objects and the number of sensors iden- 
tified above. FFE is the actual number of fitness function evaluations performed by 
GA. 

Effectiveness is used to compare the performance of different GAs. For each set of 
the experiments performed with the same values of n and P the Effectiveness was 
computed as: 

Effectiveness = Number of Optimal Runs / Total Number of Runs (4) 

The experiments with different population size are listed on Table 1 . For canonical 
GA, the crossover rate is set to 0.9 and the mutation rate is set to 1/IndividualLength; 
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for King Mutation C2, the crossover rate is 0 and the mutation rate is set to 
1/ChromosomeLength. We performed experiments with population sizes: 5, 10, 20, 
50, and 100. For each population size a canonical GA and King Mutation C2 was run 
30 times and the results in Table 1 are the average of those runs. Both methods have 
the same stopping criterion, the algorithms stop iteration if there is no improvement in 
the fitness function after a certain number of consecutive generations (This number is 
5000 in our experiment). 



Table 1: Experiment results for 5 objects 





GA Method 


Generation# 


Fitness 


PTSS 


Effectiveness 


P=5 


King Mutation C2 


5505 


-551.35 


5.76E-12 


0.80 




GA 


8565 


-889.86 


8.95E-12 


0.00 


P=10 


King Mutation C2 


6147 


-530.54 


1.29E-11 


0.95 




GA 


10151 


-672.98 


2.12E-11 


0.00 


P=20 


King Mutation C2 


5857 


-529.34 


2.45E-11 


1.00 




GA 


11453 


-642.60 


4.79E-11 


0.15 


P=50 


King Mutation C2 


5345 


-529.34 


5.59E-11 


1.00 




GA 


11228 


-562.07 


1.17E-10 


0.20 


P=100 


King Mutation C2 


5281 


-529.34 


l.lOE-10 


1.00 




GA 


10142 


-548.54 


2.12E-10 


0.20 



P: Population size; 

Generation#: Number of generations. 



Canonical GA results are pretty poor for small population sizes. With increasing 
population size, the fitness achieved by canonical GA becomes closer to the optimum. 
The best effectiveness achieved by canonical GA is for the largest population (100 
individuals) and is only 0.2 meaning that it is very difficult for the canonical GA to 
perform optimization for a sensor network with five objects. 

Results of King Mutation C2 are much superior to those of a canonical GA: it can 
obtain quasi-optimal solutions with high probability, the effectiveness being 0.8 and 
0.95 for populations of size 5 and 10 respectively. Its effectiveness becomes 1 for 
population sizes of 20 or larger. 

Consistently for each population size. King Mutation C2 gave a better result than 
the canonical GA: a much higher effectiveness and a higher fitness value. King Mu- 
tation C2 also covered roughly two times smaller search space (PTSS) than the ca- 
nonical GA in each case. Small PTSS is very important in real-world applications 
since it leads to the reduction of the computation time, allowing for a real time appli- 
cation of the algorithm. 
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6 Conclusion 

This paper describes a system performing self-organization of a sensor network. The 
goal of the system is to choose sensors necessary to perform object detection or track- 
ing while minimizing the power consumption of the entire network. In this paper, 
special emphasis is placed on the optimization performed by genetic algorithms. 

The exponential grow of the search space (with the increasing number of sensors 
and objects) makes the problem intractable for most optimization techniques in a 
reasonable time frame. Genetic Algorithms are chosen for the task, given their high 
robustness in complex search spaces. In case of multi-objective optimization prob- 
lems, such as object tracking, convergence is much more difficult to achieve. With 
the increasing number of objects. Effectiveness of canonical GAs is rapidly decreas- 
ing. The increase of GA search space makes the genetic search of the standard genetic 
algorithm inefficient and consequently the computation time needed for convergence 
becomes very large. This makes it necessary to improve the canonical genetic algo- 
rithm to speed up the convergence of the algorithm when the number of objects in- 
creases. 

We proposed a novel Genetic Algorithm with King selection strategy that some- 
what imitates the reproduction process of bees. The new King selection strategy, 
especially when coupled with a new mutation operator (Mutation C2) significantly 
improves the performance of GA for the optimization of sensor network. The new 
algorithm is very robust, giving good results for a wide range of population sizes. 
This is in contrast with traditional GAs where it is very difficult to set the value of 
population size, crossover and mutation rates. 
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Abstract. Online mining in large sensor networks just starts to attract interest. 
Finding patterns in such an environment is both compelling and challenging. 
The goal of this position paper is to understand the challenges and to identify 
the research problems in online mining for sensor networks. As an initial step, 
we identify the following three problems to work on: (1) sensor data irregulari- 
ties detection; (2) sensor data clustering; and (3) sensory attribute correlations 
discovery. We also outline our preliminary proposal of solutions to these prob- 
lems. 



1 Introduction 

Recent technology advances have enabled the development of small, battery-powered, 
wireless sensor nodes [2] [6] [20]. These tiny sensor nodes, equipped with sensing, 
computation, and communication capabilities, can be deployed in large numbers in 
wide geographical areas to monitor, detect and report time-critical events. Conse- 
quently, wireless networks consisting of such sensors create exciting opportunities for 
large-scale, data-intensive measurement and surveillance applications. In many of 
these applications, it is essential to mine the sensor readings for patterns in real time 
in order to make intelligent decisions promptly. In this paper, we study the challenges, 
problems, and possible solutions of online mining for sensor networks. 

Research on data mining has been fruitful; however, online mining for sensor net- 
works faces several new challenges. First, sensors have serious resource constraints 
including battery lifetime, communication bandwidth, CPU capacity and storage [15]. 
Second, sensor node mobility increases the complexity of sensor data because a sen- 
sor may be in a different neighborhood at any point of time [7][19]. Third, sensor data 
come in time-ordered streams over networks. These challenges make traditional min- 
ing techniques inapplicable, because traditionally mining is centralized, computation- 
ally expensive, and focused on disk-resident transactional data. 
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In response to these challenges, we propose to develop mining techniques that are 
specifically geared for sensor network environments. Our goal is to process as much 
data as possible in a decentralized fashion while keeping the communication, storage 
and computation cost low. 

As a start point, we propose three operations of online mining in sensor networks: 
(1) detection of sensor data irregularities, (2) clustering of sensor data, and (3) dis- 
covery of sensory attribute correlations. These mining operations are useful for prac- 
tical applications as well as for network management, because the patterns found can 
be used for both decision making in applications and system performance tuning. For 
example, irregularities in sensory data are of interest of monitoring applications. In 
addition, for this kind of applications, the communication cost can be reduced if only 
abnormal sensory values, as opposed to all values, need to be transmitted. 

The rest of the paper is organized as follows. In Section 2, we illustrate the need 
for online mining in sensor networks using an example. Our design of online mining 
in sensor networks is presented in Section 3. We present related work in Section 4 and 
conclude in Section 5. 



2 A Motivating Example 

There have been initial applications of sensor networks on wild life habitat surveil- 
lance [15], battlefield troop coordination, and traffic monitoring. As a motivating 
example for online mining in sensor networks, we describe a possible application on 
wild giant panda monitoring and protection in China. Suppose weather sensors are 
deployed throughout a panda habitat and wearable sensors attached to the pandas in 
the habitat. The sensors acquire sensor data on attributes such as temperature, light, 
sound, humidity, and acceleration. In addition, there is a panda of interest named 
Huanhuan. The following is a few questions that scientists on site may ask: 

1 ) Is Huanhuan having any abnormal symptoms compared with its past data? What 
other pandas are having abnormal symptoms and what are these abnormal symptoms? 

2) What pandas have a similar physical status to Huanhuan’ s? What pandas are 
similar to one other and on what sensory attributes are they similar? 

3) What attributes of Huanhuan’s are correlated and how are they correlated? What 
attributes of pandas are correlated with the humidity of their habitat? What symptoms 
of pandas are correlated to what attributes of the habitat? 

Answers to these questions are important for habitat maintenance and panda pro- 
tection, and these questions all require online mining of sensor data. 



3 Design of Online Mining 

In this section, we identify the following three problems of online mining in sensor 
networks and outline our preliminary solutions. 
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3.1 Detection of Sensor Data Irregnlarities 

The problem of irregularities detection is to find those sensory values that deviate 
significantly from the norm. This problem is especially important in the sensor net- 
work setting because it can be used to identify abnormal or interesting events or faulty 
sensors. 

We break this problem into two smaller problems. One is to detect irregular pat- 
terns of multiple sensory attributes and the other to detect irregular sensory data of a 
single attribute with respect to time or space. The irregular multi-attribute pattern de- 
tection problem has the assumption that there are some normal patterns among multi- 
ple sensory attributes, which is true in some natural phenomena. Once these normal 
patterns are broken somewhere, the irregularity is detected and reported. In contrast, 
the irregular single-attribute sensor data detection problem examines the temporal and 
spatial characteristics of a sensor node and detects any irregularity in comparison with 
the node’s previous data or the data of the neighbor nodes. 

3.1.1 Detection of Irregular Patterns 

We propose a new approach named pattern variation discovery to solve this problem. 
Our approach works in the following four steps: i) Selection of a reference frame. 
This frame consists of the directions along which we want to look for irregularities 
among multiple sensory attributes. An analyst can explicitly specify the reference 
frame. It is also possible to discover the reference frame that results in a lot of irregu- 
larities. ii) Definition of normal patterns. This definition can be models of multiple 
sensory attributes or constraints among multiple attributes, iii) Incremental mainte- 
nance of the normal patterns. Whenever a sensor gets a new round of readings, the 
normal patterns are adjusted incrementally, iv) Discovery of irregularity. Whenever a 
normal pattern is broken at some point along the reference frame, an irregularity ap- 
pears. That is, the pattern variation happens. 

For example, we want to discover the irregular distribution pattern among multiple 
sensory attributes along time. Then, for each time point, we can put the values of a 
group of sensory attributes at a series of sensor nodes into a matrix, which represents 
a distribution status. The problem then becomes to discover the irregular matrix 
among a set of matrices. An irregular matrix represents that, at the corresponding time 
point, the distribution pattern of all the sensory attributes on all the nodes are irregu- 
lar. Because our approach involves a lot of comparisons between matrices, we pro- 
pose to use the technique of Singular Value Decomposition (SVD) [4]. SVD is a 
powerful data reduction and approximation technique, which extracts the useful fea- 
tures of a matrix. Using SVD, we can get a vector of singular values out of a matrix. 
Consequently, matrix comparison becomes vector comparison, which is less compu- 
tationally expensive and reduces communication cost. Additionally, integrating SVD 
with the sliding window mechanism, we can handle streaming sensory data. 

3.1.2 Detection of Irregular Sensor Data 

Detection of irregularities is tightly interrelated to modeling of sensor data. Therefore, 
we propose to detect irregular single-attribute sensor data with respect to time or 
space by building models. 




Online Mining in Sensor Networks 



547 



For temporal irregularities in sensor data, we build a model of the sensory data as 
the readings of a node come in. When some reading substantially affects the coeffi- 
cients of the model, it is identified as an irregularity. With resource constraints of 
sensor nodes, we may need to approximate the distribution of data instead of main- 
taining all historical data. In many applications, it suffices to consider the most recent 
N values in a sliding time window. 

For spatial irregularities in sensor data, we build a statistical model of readings of 
neighboring nodes. If some readings of a node differ from what the model anticipates 
based on the readings of the neighboring nodes, an irregularity is detected. In order 
to reduce resource consumption, we may define the neighboring nodes to be those 
only a single hop away on the network. As a node moves geographically, the parame- 
ters of its model is incrementally adjusted. Distributed modeling is also possible. 

Finally, there is a tradeoff between model accuracy and resource consumption. 
On one hand, modeling can reduce resource consumption because only model pa- 
rameters are stored and transmitted instead of a large amount of sensor data. On the 
other hand, highly accurate models may result in a large size and frequent updates of 
data, which increases resource consumption. 



3.2 Clustering of Sensor Data 

We propose a new approach named multi-dimensional clustering of sensor data. This 
approach works as follows. First, cluster the sensor data along each sensor attribute 
separately. All resulted clusters form a set of clusters, which we call the Cluster Set. 
Second, construct a bipartite graph G with the Sensor Set (the set of sensor nodes) and 
the Cluster Set being the two vertex sets. If some sensory attribute value of a sensor v 
belongs to cluster u, there is an edge pointing from v to u. Last, find all of the maxi- 
mal complete bipartite sub-graphs, i.e., the maximal bipartite cliques of G. These 
cliques identify “which sensor nodes have similar sensory readings on which attrib- 
utes”. 

Figure 1 illustrates an example of multi-dimensional clustering. In the bipartite 
graph, the set of vertices on the top is clusters of sensor data by single-attribute of 
sensors, e.g., Tl, T2, and T3 are clusters by the temperature attribute, LI, L2, L3 by 
light, and FIl, FI2, FI3 by humidity correspondingly. The set of vertices in the bottom 
of the figure is the sensor nodes, with the dark ones being representative nodes of a 
clique. There are three cliques of the sensor nodes in the figure. 

The cliques resulted from multi-dimensional clustering are useful not only for data 
analysis applications but also for network management and query optimization. Since 
all sensor nodes in one clique are similar, we can select a representative node for each 
clique to work on behalf of its clique in order to save power consumption with a re- 
duced accuracy of data. This selection can be based on the residual energy of a node 
or the distance between the node and the base station. We can also select representa- 
tive nodes based on a cost function of assigned tasks. In addition, the role of a repre- 
sentative node can be rotated among the nodes in one clique for load balancing. 
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Fig. 1. Multi-dimension clustering 



3.3 Discovery of Sensory Attributes Correlations 

Sensory attributes are rarely independent and eorrelations are eommon. For example, 
empirieal evidenee has shown that temperature and humidity are elosely eorrelated in 
some natural environment. Therefore, effieiently identifying eorrelations among mul- 
tiple sensory attributes is important for data analysis applieations. For instanee, we 
ean estimate the ehanges of some attributes from the ehanges of the eorrelated attrib- 
utes. 

We treat readings of eaeh sensory attribute as a data stream, i.e., a sequenee of data 
items V, at the sequenee number i. As we are interested in eorrelation between data 
ehanges of various sensory attributes, e.g., eorrelation between the ehange of tem- 
perature values and that of light values, we replaee eaeh sensory data value v, with its 
differenee from its previous data item. A, = v, - x,.y. Thus, a time series of sensor data 
is represented as a sequenee of A,. 

Let S\, ..., Sm-\, Sm be a eolleetion of m sensor data streams, eaeh for one attribute. 
One way of representing these data streams is to use a matrix A with time points and 
attributes being row and eolumn indexes. We ean then group data by eorrelated at- 
tributes or eorrelated time points [4][10] in this matrix. Reeall that we propose to use 
the SVD teehnique (Singular Value Deeomposition) for matrix reduetion in pattern 
variation diseovery. Flere, this teehnique ean also be used to find the best subspaee 
that identifies the strongest linear eorrelations in the underlying data set [10]. Addi- 
tionally, SVD tries to identify similarity patterns (reetangular regions) of related val- 
ues in the A matrix, and the similarity of eaeh row with the patterns [4]. It will natu- 
rally group similar “attribute-name” into attribute groups with similar behavior. In 
pattern variation diseovery, we use SVD to speed up the eomparison among multiple 
matriees. Flere, in eorrelation discovery, we use SVD to consider the correlation 
among the rows within one matrix. 

Alternatively, we can consider sensory attributes correlations as inter-transaction 
association rules [13]. For example, a mle or correlation says that “if node A’s attrib- 
ute X goes up at time point 1, B’s attribute y will (with an 80% possibility) go up at 
time point 2 and C’s attribute x will go down at time point 3”. Flowever, in the con- 
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text of a large-scale mobile sensor network, the problem is more complex and chal- 
lenging than traditional market basket analysis. 



4 Related Work 

There has been much work in the areas of sensor networks, data mining, and data 
streams, but little work has been done at the intersection of these areas. 

Sensor networking protocols have attracted a tremendous amount of research effort 
[6]. Sensor databases and query processing techniques have been proposed for ac- 
quiring and managing sensor data [14][15]. However, existing sensor databases lack 
support for complex, online mining operations. 

There is extensive literature regarding outlier (irregularity) detection [12] [21]. 
However, none of these approaches is directly applicable to a sensor network envi- 
ronment. There is also initial work on modeling sensor data, including a distributed 
model based on kernel density estimators [16], and a distributed regression frame- 
work [9]. 

Although the clustering problem has been widely studied [5], we have not seen any 
previous work on multi-dimensional clustering. Existing sensor network clustering 
methods [3] [8] [22] mainly concern about the distance among nodes and the network 
topology, not sensory data. Recently, the clustering problem has also been studied in 
data streams [1][18]. 

There is some work on correlated data items [11] with respect to their accesses in 
order to improve data accessibility in sensor networks. In comparison, we focus on 
finding out correlations among sensory values. A related problem is identifying cor- 
relations among streams [10]. There has also been initial work on online analytical 
processing and mining for data streams [1][17][18]. However, they seldom consider 
the unique challenge in sensor networks. 



5 Conclusions 

We have identified the challenges for online mining in large-scale, mobile sensor 
network environments. The main concern is to satisfy the mining accuracy require- 
ments while maintaining the resource consumption to a minimum. We identify three 
research problems to work on: (1) sensor data irregularities detection; (2) sensor data 
clustering; and (3) sensory attributes correlations discovery. We provide preliminary 
considerations towards solving these problems. We believe that the patterns discov- 
ered can not only enable the applications to gain insight into the sensor data, but also 
be used to tune the system performance. As future work, we will consider more about 
energy-awareness, adaptivity, and fault-tolerance of online mining for sensor net- 
works in addition to a further study of our proposed approaches. 
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Abstract. Many sensor network applications are data-centric, and data analysis 
plays an important role in these applications. However, it is a challenging task 
to find out what specific problems and requirements sensory data analysis will 
face, because these applications are tightly embedded in the physical world and 
the sensory data reflect the physical phenomena being monitored. In this paper, 
we propose to use field studies as an alternative for identifying these problems 
and requirements. Specifically, we deployed an experimental sensor network 
for monitoring the frog pond in our university and analyzed the collected sen- 
sory data. We present our methodology of sensory data collection and analysis. 
We also discuss preliminary analytical results from the collected sensory data, 
together with our generalization for similar sensor network applications. We 
find that this case study helped us identify and understand several problems, ei- 
ther general or specific, in real-world sensor network application deployment 
and sensory data analysis. 



1 Introduction 

Sensor network applieations pose a number of novel problems for networking 
([4][9][10]) and data management ([7][15]). Nevertheless, many more problems and 
requirements in real-world sensor network applieations are to be identified and under- 
stood, espeeially for sensory data analysis. Due to the tight integration of these appli- 
eations with the physieal world, field studies are effeetive, sometimes neeessary, for 
identifying problems and requirements. In this paper, we present a ease study of 
sensory data analysis for a small-seale real-world sensor network applieation. Our 
goal is to identify and understand problems and requirements speeifieally for sensory 
data analysis. 

From our ease study, we observe that most of the problems in our sensory data 
analysis rose beeause the sensor network applieation was deeply embedded in the 
physieal environment and the sensory data refleeted the physieal phenomena under 
study. For instanee, we find that even though there were inherent trends in the read- 
ings of individual sensors as well as strong eorrelations between readings of multiple 
sensors, outliers were eommon and the eauses of some outliers were hard to deter- 
mine. 
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The remainder of the paper is organized as follows. Seetion 2 introduces the de- 
ployment of the case study. Section 3 presents our preliminary analytical results of 
the collected sensory data and discusses the generalization of our experience. Section 
4 compares related work and Section 5 concludes the paper. 



2 Deployment of the Case Study 

We started the case study around a frog pond on the HKUST campus in April 2004. 
The frog pond is located at the northeastern comer of the campus and is surrounded 
by two pagodas and various plants. Throughout the late spring, the frogs in the pond 
croak loudly all day long. 

The smart sensor nodes we used in the case study were the Crossbow MICA2 
motes [3]. Each mote consists of an Atmel Atmegal28L low-power micro-controller 
ranning TinyOS [13] with a 900MHz radio channel. Mote 0 connects with a PC- 
grade base station through a PC interface card, and other motes each consists of a 
MICA2-compatible sensor board. The scale of the case study was small due to our 
resource limit. 

We deployed a total of nine MICA2 motes in two groups, with each group in a pa- 
goda around the frog pond. Fig. 1 shows the deployment of the two groups. The 
base station (Mote 0) of each group was connected to the serial port of a notebook 
through a MIB510CA interface board and a serial cable. 






Fig. 1. Deployment of two groups of Motes 



Group 1 was deployed in the pagoda that is surrounded by the frog pond. Each of 
Motes 1-5 was attached with a MTS310CA sensor board, which includes a tempera- 
ture sensor, a light sensor, a microphone, a 2-axis accelerometer and a 2-axis magne- 
tometer. We installed TinyDB [12] on the motes and used the TinyDB GUI to collect 
sensor readings and to log the readings to a text fde. 

Group 2 was deployed in the pagoda that is near the frog pond and overlooks the 
sea. Its Motes 1-2 used the MTS420CA weather sensor boards. This type of sensor 
board consists of a humidity and temperature sensor, a barometric pressure and tern- 
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perature sensor, an ambient light sensor, a 2-axis aeeelerometer and a GPS module. 
We eonfigured the Xlisten program and the eorresponding on-mote module XSen- 
sorMTSdOO downloaded from the TinyOS SoureeForge CVS [14] for logging the 
sensory data from this type of sensors. 

It was eloudy with intermittent rain on the day of our data eolleetion. We eolleeted 
one-day data in four two-hour periods of the day: 6:30 - 8:30 (morning), 12:30 - 
14:30 (noon), 17:30 - 19:30 (dusk), and 22:00 - 24:00 (night). We set the sample 
period to be 30 seeonds. At the end of the eight-hour data eolleetion, we logged 
thousands of sensor readings per group. 

We used homegrown programs to pre-proeess the eolleeted sensory data before 
eondueting further analysis. First, we eonverted the sensor readings from raw ADC 
eounts generated by the sensor boards to more human- friendly engineering units (e.g., 
Celsius degrees for temperature) using eonversion formulas provided by Crossbow. 
Next, we parsed and imported the sensor readings (both ADC eounts and engineering 
units) into a Mierosoft Aeeess database. Finally, we used SQL queries and Mierosoft 
Exeel Charts to perform preliminary data analysis. 



3 Sensory Data Analysis 

In this seetion, we analyze the sensory data we eolleeted in the experiment. From a 
database perspeetive, our foeus of the analysis is on identifying and understanding the 
problems and requirements that are speeifie for sensory data. Apparently, the analyti- 
eal results we present here are preliminary. However, the methodology and insights 
gained from these initial analytieal results are valuable for more advaneed analysis 
and are otherwise unavailable or less eonvineing without the ease study. 



3.1 Trends in Readings of Individnal Sensors 

We first give an analysis of individual sensor readings using some typieal examples. 
Light. Fig. 2 shows the light readings of Mote 1 and Mote 5 in Group 1. We piek 
these two motes beeause in our deployment they were the nearest (Mote 1) and far- 
thest (Mote 5) motes from the Group 1 base station. Eaeh point in the figure eorre- 
sponds to one light reading. In the morning, the light readings kept on inereasing due 
to the sunrise. At noon, the light readings were at the highest for the day and slightly 
deereased past noon. At dusk, the light readings deereased sharply due to the sunset 
and then jumped up to a eertain level beeause the lamps around the pagoda were lit. 
The light readings at night remained almost eonstant with the lamplight. 

The two motes in Fig. 2 had similar readings. The other three motes of Group 1 
also had similar readings to these two. This similarity was beeause the area of a pa- 
goda is small and thus the motes in this group were loeated near to one another. The 
proximity of motes also made readings of other sensors (e.g., humidity, temperature) 
of a group similar. 

However, for Group 2, the readings of the ambient light sensors remained a eon- 
stant value of 131.448624 Lux due to a bug in the XSensorMTS400 program that we 
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used. As a result, a comparison of light readings between the two groups is not done, 
We are developing our own data logging programs for future usage. 



Light (ADC mv) 
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Fig. 2. Light readings of Group 1 
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Fig. 3. Temperature readings of two groups 



Temperature. Fig. 3 shows the temperature readings of Group 1 Mote 1 and Group 
2 Mote 2. We put the readings from different groups in one figure for comparison 
purpose. Again, temperature readings from motes of the same group were similar due 
to their proximity. 

The temperature readings of Group 1 motes varied from 21 to 24°C, whereas those 
of Group 2 motes varied from 21 to 23°C. The temperature measured by Group 2 
was often slightly higher than that measured by Group 1 (except around noontime), 
even though the two pagodas were within a distance of 20 meters from each other. 
We think there are two possible reasons: (1) the temperature sensors of the two 
groups are made by different companies and therefore differ in hardware characteris- 
tics, (2) the microclimates in the two pagodas differ due to their different geographi- 
cal locations. 

Humidity. Humidity sensors were available only in Group 2. The humidity readings 
of the two motes in Group 2 are illustrated in Fig. 4. Most of the time, the readings 
remained at the level of around 90%. 

Note there were some abnormally high humidity readings (larger than 130%) of 
Mote 1 at the beginning of the morning period. These abnormal readings were be- 
cause some rain drops splashed onto the Mote by accident. The water made the hu- 
midity sensor malfunction and return abnormally high readings. This kind of physi- 
cal failure is not uncommon and is recoverable [11]. After being dried, the sensor 
returned to normal operation. 

Noise. Microphone sensors were available only in Group 1. Fig. 5 shows the noise 
readings of Mote 1 and Mote 5 in Group 1 . Unlike temperature, light, and humidity 
readings, which are more continuous, the noise readings are more discrete. The scat- 
tered data points in the noise readings in Fig. 5 usually suggest the actual, sudden 
changes (events) in the sound level in the environment. In comparison, those outlier 
points in temperature, light, and humidity readings in the previous figures were often 
due to errors. 
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From Fig. 5 we see that the frogs croaked most actively in the early morning and 
were most quiet around noontime. Also, some of the high data points in the figure 
were because people passing by were talking. 

Correlation. Our analysis on correlation of sensor readings is limited and will be an 
important part of our future work. So far we have found that the temperature and 
humidity readings were inversely correlated and that the temperature and light read- 
ings were not correlated as found in other environments [5]. 
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Fig. 4. Flumidity readings of Group 2 



Fig. 5. Noise readings of Group 1 



3.2 Discussion 

Flaving presented our initial results with a specific sensor network application, we 
discuss how to generalize our findings to similar sensor network applications. 

We start by summarizing the problems that we have encountered in the application 
deployment and data analysis. First, both hardware problems and software bugs are 
common in sensor network applications. The reasons include sensor networks being 
an emerging technology and the typical application environment being the physical 
world full of unpredictable events and changes. Second, data pre-processing and 
post-processing constitutes a large amount of work in order to facilitate sensory data 
analysis. This work mainly includes data cleaning and format conversion prior to 
analysis and visualization after analysis. Third, sensory data exhibits regularities as 
well as abnormalities, and the causes of outliers are hard to determine. 

Based on this summary of experienced problems, we propose the following three 
requirements for a sensory data analyzer. 

(1) The analyzer should have data acquisition functions that are fault-tolerant and 
adaptive, since the sensory data collection process determines the quality of sensory 
data. The fault-tolerance requirement is because hardware malfunctioning is common 
in field studies, as we have already experienced. It is thus desirable that a data collec- 
tor is able to recover, to migrate the work from a failed node to a normal node, and to 
resume the work. The adaptivity requirement is to take advantage of the patterns and 
regularities captured in sensor readings. For instance, continuous quantities such as 
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temperature ean be measured with a sampling frequeney adapted to the ehanges in the 
temperature readings in order to improve power effieieney while keeping the quality 
of sensory data unaffeeted. 

(2) The analyzer should have a set of basie funetions for data pre-proeessing and 
post-proeessing operations. Data pre-proeessing is to further ensure the quality of 
data for analysis. Data post-proeessing is mainly for the presentation of analytieal 
results. For example, the funetion eonvert() eonverts sensor readings from raw ADC 
eounts to human- friendly engineering units, the funetion ealibrateQ performs hard- 
ware-speeifie ealibration of the readings, and the funetion plot() plots data points and 
eurves together with analytieal summaries following user-defined eriteria. 

(3) As the eore of the analyzer, the sensory data analysis funetions inelude pattern 
and outlier deteetion, and eorrelation of multiple sensory attributes or multiple sensor 
nodes. We further diseuss these two kinds of funetions as follows. 

First, deteeting patterns and outliers in single-node single-attribute sensory data is 
the basie analytieal operation. For instanee, given the temperature readings of one 
sensor node, the basie analytieal information about these readings must inelude a 
summary of the range, the trend, and the outliers of the data. As a result of measuring 
natural phenomena, sensory data has inherent patterns as well as outliers. Moreover, 
outliers sometimes are due to real events in the environments and sometimes due to 
system errors. It is neeessary to pay speeial attention to outlier analysis. 

Seeond, eorrelation analysis gives more insights into sensory data, espeeially be- 
eause eaeh sensor node has multiple sensory attributes and multiple sensor nodes 
work eoneurrently in a geographieal region. The inherent eorrelations between natu- 
ral phenomena as well as the temporal and spatial eorrelations of sensor nodes will be 
useful for both sensor network applieations and system deployment. For example, 
when an applieation is deteeting transient ehanges sueh as a sudden inerease in the 
noise level, it ean utilize the spatial eorrelation of a eluster of adjaeent nodes to deteet 
the noise with a high fidelity. In other words, if one sensor node deteets a sudden 
inerease of noise level, it might be a real event as well as a system error. But if mul- 
tiple nearby nodes report the same event, the probability of a system error is mueh 
lower and that of a real event is mueh higher than reported by a single node. 

In summary, we find several problems in sensory data analysis, ranging from 
hardware or software problems in the deployed applieations to diffieulties in produe- 
ing meaningful analytieal results out of sensory data. Correspondingly, we propose 
several requirements for sensory data analysis systems, ineluding fault-toleranee and 
adaptivity of data eolleetion, a set of data pre-proeessing and post-proeessing fune- 
tions, and basie data analysis funetions sueh as pattern, outlier, and eorrelation detee- 
tion. Our ultimate goal is to build a general sensory data analysis system for various 
data-eentrie sensor network monitoring applieations. 



4 Related Work 

A number of sensor network projeets have real-world deployment, ineluding ALERT 
[1], GDI ([8][II]), PODS [2], Surveillanee andNIMS [6]. 
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ALERT (Automated Local Evaluation in Real Time) is a well-known, practical 
sensor network application [1]. It provides real time rainfall and water level informa- 
tion for forecast of flooding. ALERT mainly focuses on special-purpose sensory data 
statistics and uses them for prediction. 

Both GDI ([8][II]) and PODS [2] deployed sensor networks in outdoor environ- 
ments mainly for the purpose of system performance study. Specifically, GDI de- 
ployed a multi-tier sensor network for habit monitoring whereas PODS was deployed 
in Hawaii Volcanoes National Park. 

Surveillance and NIMS [6] are two demonstrations. Surveillance built an energy- 
efficient surveillance system using a wireless sensor network, and NIMS focused on 
new, mobile sensing devices on a suspended infrastructure. 

In comparison, our case study is at a smaller scale and a finer level, with a focus 
on identifying general problems and requirements for advanced sensory data analysis 
in real-world applications. 



5 Conclusions 

In this paper, we describe our case study of deploying a small-scale monitoring appli- 
cation at the frog pond in our university and analyzing the collected sensory data. 
Our goal is to identify the problems and requirements for sensory data analysis in 
real-world sensor network applications. We find that (1) data collection and logging 
functions need to be failure-aware and easy to resume, (2) data pre-processing such as 
format conversion and post-preprocessing such as visualization is necessary for sen- 
sory data analysis, and (3) essential sensory data analysis functions include pattern 
and outlier detection for readings of individual sensors and correlation detection for 
readings of multiple sensors. 

Our future work includes designing and implementing advanced sensory data 
analysis tools and conducting larger-scale studies using these tools. 
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Abstract. A joint effort between the Chinese Academy of Sciences and the 
Hong Kong University of Science and Technology, the BLOSSOMS sensor 
network project aims to identify research issues at all levels from practical ap- 
plications down to the design of sensor nodes. In this project, a heterogeneous 
sensor array including different types of application-dependent sensors as well 
as monitoring sensors and intruding sensors are being developed. Application- 
dependent power-aware communication protocols are also being studied for 
communications among sensor nodes. An ontology-based middleware is built 
to relieve the burden of application developers from collecting, classifying and 
processing messy sensing contexts. This project will also develop a set of tools 
allowing researchers to model, simulate/emulate, analyze, and monitor various 
functions of sensor networks. 



1 Introduction 

Recent advances in wireless communication and hardware have enabled the dense 
deployment of distributed sensor networks and opened a wide range of application 
domains, such as environmental monitoring for endangered species or ecosystems of 
changes in light, temperature, pressure, acoustics; security surveillance preventing 
attacks in the forms of chemical, biological, or radiological weapons; target tracking 
of moving objects; and battlefield awareness [1]. All these applications may have 
information collected from sensor nodes to be sent to the gateway (sink) from time to 
time. By exploiting the sensor network’s spatial coverage and multiplicity of sensing 
modalities, the network can achieve a good global measurement. Research in sensor 
networks has received a great deal of attention worldwide, where the most notable 
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one is Berkeley’s MOTE [2], due to its potential impaet on so many applieation do- 
mains. 

In March 2004, recognizing the importance of sensor networks, the Chinese Acad- 
emy of Sciences and the Hong Kong University of Science and Technology have 
jointly launched an effort to investigate both fundamental and practical research is- 
sues in sensor networks. The goal of this research is to build lightweight optimized 
sensor systems on a massive scale, namely the BLOSSOMS project. The objective of 
this research project is to identify research issues at all levels from practical applica- 
tions down to the design of sensor nodes. As a multidiscipline research, this project 
involves researchers from different disciplines including hardware design, embedded 
systems, wired and wireless communications, software engineering, distributed query 
processing, machine learning, and performance evaluation. In a short period of time, 
the project has made a very good progress. This paper will give an overview of the 
BLOSSOMS project and introduce some research directions being investigated. 



2 Project Overview 

Figure 1 shows the system architecture of the BLOSSOMS project. There are four 
layers. The bottom two layers are implemented in individual sensor nodes, while the 
top two layers are executed at gateway nodes or some application nodes. Typically, a 
gateway node will broadcast a command to all active sensor nodes and selected active 
sensor nodes will report the result back to the gateway node. The communication 
among the sensor nodes and gateway nodes is wireless and usually short ranged to 
reduce power consumption, such as ZigBee. The communication among gateway 
nodes and application nodes can be either wired or wireless and typically adopts the 
standard protocols, such as 802.1 lx and TCP/UDP/IP. In addition, the MEADOWS 
module (on the left side of the figure) provides a set of tools for sensor network stud- 
ies. The following sub-sections describe each BLOSSOMS component in more detail. 
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Figure 1: The BLOSSOMS system architecture. 
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2.1 BUDS: Heterogeneous Sensor Nodes 

A sensor node is an embedded system eonsisting of four basie eomponents, namely a 
sensing unit, a proeessing unit, a eommunieation unit, and a power unit. Additionally, 
there may be applieation-dependent, optional units sueh as a mobilizer, a loeation 
finding system, and a power generator [1]. Our researeh on sensor node arehiteeture 
is eurrently undertaking at ICT/CAS. The goal is to build multifunetional sensing 
nodes, ealled BUDS, targeting on applieations sueh as intelligent transportation sys- 
tems, preeision agrieulture, remote medieal eare, and publie safety monitoring system. 
Efforts are made to taekle the following teehnieal issues: 

• the development of an embedded operating system for ultra low power sensor 
nodes; 

• the design of general sensor nodes and system integration method; 

• the design of speeial-purpose sensor nodes, sueh as video sensors, monitoring 
sensors, and intruding sensors; and 

• the low power management and high reliability design at the system level. 

While the general sensors require ultra low power eonsumption, the speeial sensors 
may have different requirements, sueh as a powerful CPU. 

The monitoring sensor is a speeial sensor needed to help debug and monitor the 
behavior of other sensors. The monitoring sensor will operate on two frequeneies. 
One is the same as other sensors, whieh only operates in the listening mode to ob- 
serve the behavior of other sensors. Another ffequeney is used to report the eolleeted 
information to the MEADOWS for further analysis. Power eonsumption is not a 
eritieal issue for sueh sensors. The intruding sensor is used to simulate malieious 
sensors trying to attaek a working sensor array. We will use the standard low power 
ZigBee as the wireless eommunieation meehanism. 

2.2 Communication Protocols 

The eommunieation protoeol plays a major role to allow eommands to be propagated 
from gateway nodes to aetive sensor nodes and to eolleet replies from some sensor 
nodes to the gateway nodes. The eommunieation pattern is applieation dependent and 
many protoeols for different applieations have been proposed (e.g., [3]). Furthermore, 
sensor nodes are unlikely to have a global unique ID, whieh is too expensive. This 
ereates a new ehallenge to develop eontent-based routing among sensor nodes. 

If the loeation of eaeh sensor node ean be identified, more useful information ean 
then be eolleeted. Determining the loeation of sensor nodes is another ehallenge to be 
addressed in this researeh. We are investigating both anehor-free [4] and anehor- 
based loeation sensing teehniques [5]. 

2.3 CABOT : Ontology-Based Middleware 

Applieations of sensor networks often need to be aware of their eontexts, whieh eould 
be related to sensors, time, loeation, and seeurity. These applieations typieally have to 
earry out tedious tasks of gathering, elassifying and proeessing messy eontext infor- 
mation due to laek of middleware support. The middleware support for these appliea- 
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tions is a major topic in pervasive computing. CABOT is a user-centric middleware 
being developed to provide infrastructural support of sensor network applications. 
The main responsibilities of CABOT are to provide (a) the resource management of 
sensor networks, (b) the analysis of dynamic environments, and (c) the development 
support of context-aware pervasive computing. Current features of Cabot include: 

• The support of extensible context gathering (decoupling applications and sensor 
networks), application-customized context classifying (subscribing interested 
context information in application-specific ways) and intelligent context process- 
ing (reasoning on application requirements). 

• The provision of development support for context-aware pervasive applications. 
This relieves application developers from the tedious programming of gathering, 
classifying and processing messy context information in sensor networks. 

• The provision of naming services to facilitate effective collaboration amongst 
applications and sensor networks. 

CABOT also takes into account the privacy issue. The privacy services in CABOT 
are able to modify or even hide some certain kinds of sensor contexts based on user 
identities and relevant privacy policies. For further details about CABOT, please refer 
to another paper in this workshop [6]. 

2.4 Application and User Interface 

On top of CABOT, many applications may be developed. One application being 
developed is a location-estimation system. Such systems can be used to support many 
location based services, such as location based content delivery and object and people 
tracking. It is still a challenging problem how to predict accurately the location of a 
wireless device in an indoor environment, especially when one holds a small device 
with limited computational and power resources. On receiving signals from various 
access points distributed in an indoor wireless environment, it is natural to ask: where 
is the user now in the building? And, how may we best help the user [7]? 

The above questions present a challenge for computer science in general and artifi- 
cial intelligence in particular. We have developed an integrated framework called 
LEAPS (location estimation and action prediction system), which provides solutions 
for a number of location-related applications, including location queries and location- 
based user behavior recognition [8]. Such a system can be useful for a variety of 
applications. In education, the system can provide a student walking in the faculty 
office area after a class with information about professors, their office hours and class 
information. The system can offer help by giving directions to the right office. For 
people suffering from various cognitive limitations in hospitals and care facilities, the 
technique can discover when a person’s behavior is out of the norm and provide help 
in a timely manner. For students in a university campus, the plan recognizer can 
enable intelligent pre-fetching of class-content-related information. For shoppers in a 
busy business environment such as a shopping mall, services and products can be 
offered not only according to people’s current location, but also according to their 
intended goals and actions. Plan recognition using a relational model also allows 
prediction of new locations that a user might be visiting even if the location is new to 



a user. 
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Some applications require integrated data access, communication, and actions on 
heterogeneous devices. For instance, a lab monitoring application may need to ac- 
quire sensor readings from the environment and to operate network cameras to cap- 
ture the scene upon the detection of a suspicious event. For these applications, a 
declarative, SQL (Structured Query Language) interface is useful in order to specify 
the data interested and actions intended. The interface provides a relational view of 
the data flowing from sensors as well as an abstraction of device-oriented operations 
(actions). Systems support and optimization mechanisms are implemented behind the 
interface for ease of application development and performance [9]. 

2.5 MEADOWS: A Suite of Evaluation Tools 

We are developing a collection of tools for large-scale, in-depth studies on sensor 
networks. Specifically, these tools are on modeling, emulation, and data analysis for 
wireless sensor networks (MEADOWS) [11] in addition to monitoring. We have 
designed a hierarchical power consumption model for sensor databases, a distributed 
emulator called VMNet (Virtual Mote Network) of sensor networks [10], and a few 
data analysis functions for sensory data. These tools will interact with all layers of 
BLOSSOMS closely. On one hand, information about real systems (e.g., sensor node 
statistics from the BUDS layer) provides system parameters for MEADOWS; on the 
other hand, MEADOWS enables analysis and validation of real systems. 



3 Conclusions 

We have presented an overview of the BLOSSOMS project being investigated at 
CAS and FIKUST. Building a working sensor (BUDS) is a major challenge to the 
project. The MEADOWS tools are also essential because they allow us to simulate 
the behavior of large-scale sensor applications. The CABOT middleware provides 
infrastructural support for applications and the LEAPS system allows location-based 
applications to be easily built. This project involves close collaboration of research- 
ers from different disciplines because of the broad project scope. At the time of writ- 
ing, the project is still in its beginning stage. As the project moves forward, we ex- 
pect to identify more research issues and produce more research results. 
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Abstract. A set of sensor nodes is the basic component of a sensor network. 
Many researchers are currently engaged in developing pervasive sensor nodes 
due to the great promise and potential with applications shown by various wire- 
less remote sensor networks. This short paper describes the concept of sensor 
node architecture and current research activities on sensor node development at 
ICTCAS. 



1 The Concept of Sensor Node Architecture 

A sensor network is made up of the following parts, namely a set of sensor nodes 
which are distributed in a sensor field, a sink which communicates with the task man- 
ager via Internet interfacing with users. A set of sensor nodes is the basic component 
of a sensor network. Many researchers are currently engaged in developing pervasive 
sensor nodes [1-3 ] due to the great promise and potential with applications shown 
by various wireless remote sensor networks [4-10]. A sensor node is composed of 
four basic components as shown in Fig. 1 . They are a sensing unit, a processing unit, 
a communication unit and a power unit. 

Sensing Unit Processing Unit Communication Unit 




Fig. 1. The components of a sensor node 

Sensing units are usually made up of application specific sensors and ADCs (ana- 
log to digital converters), which digitalize the analog signals produced by the sensors 
when they sensed particular phenomenon. In some cases, an actuator is also needed. 
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Obviously sensors play a key role in a sensor network which are the very front end 
connecting our physical world to the computational world and the Internet. Although 
MEMS technology has been making steady progress in the past decades, there is still 
large space for the further development of smart front end sensors. Among them, 
various chemical and biochemical sensors remain one of the most challenging sensor 
groups to be explored and developed, e.g. sensors to detect toxic or explosive trace in 
public areas, sensors for diagnostic analysis and sensors used under extreme condi- 
tions. New sensing principle, new sensing material and new sensor design need to be 
invented and adopted. 

The processing unit is usually associated with an embedded operating system, a 
microcontroller and a storage part. It manages data acquisition, analyzes the raw sens- 
ing data and formulates answers to specific user requests. It also controls the commu- 
nication and performs a wide variety of application specified tasks. Energy and cost 
are two key constraints for processing components. Nodes may have different types 
of processors for certain specific tasks. For example, a video sensor node may need a 
more powerful processor to run than a common temperature sensor. A small embed- 
ded operation system such as Berkeley’s TinyOS [ 2 ] is another key issue for an 
embedded system. Besides the basic ability for process management and resource 
management, it may also possess the capability for software tailor and real time man- 
agement, the ability to provide support for embedded middleware, network protocols 
and embedded database. 

The transceiver connects the sensor node to the network. Usually each of the sen- 
sor nodes has the capability to transmit data to and receive data from another node 
and the sink. The latter may further communicate with the task manager via Internet 
(or Satellite) and information reaches the end user. A transceiver is the most power- 
consuming component of the node. Thus the study of multi-hop communications and 
complex power saving modes of operation, e.g. having multiple different sleep states, 
is crucial in this content. 

The power unit delivers power to all the working parts of the node. Because of the 
limited capacity of the power unit, e.g. the limited lifetime of a battery, the develop- 
ment of the power unit itself and the design of a power saving working mode of the 
sensor network remain some of the most important technical issues. For some appli- 
cations, a solar battery may be used. 

Additionally, a sensor node may have application dependent functional subunits 
such as a location finder, a mobilizer , a power generator and other special-purpose 
sensors. The nature or number of such subunits may vary, depending on the applica- 
tion needs. It is a very interesting area to be continuously exploited. 



2 Current Research Activities at ICTCAS 

Research on sensor node architecture is currently undertaking at ICTCAS. Our goal is 
to build multifunctional sensing nodes targeting on applications such as intelligent 
transportation system, precision agriculture, remote medical care and public safety 
monitoring and notification and so on. 
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Efforts are made to tackle the following technical issues: 

1) The development of new type of chemical sensors for the detection or monitor- 
ing of explosives using new sensing principle and materials, e.g. function polymer 
and nano materials; the design of special-purpose sensor nodes, such as video sensors, 
monitoring sensors, and intruding sensors; 

2) The development of an embedded operating system for an ultra low power sen- 
sor node. Besides the basic ability for process control and resource management, it 
will also possess the capability for software tailing and real time management, the 
ability to provide support for embedded middleware, network protocols and embed- 
ded database; 

3) The design of sensor node hardware and system integration. Here low power 
consumption and high reliability design are our main concern; 

4) Hardware and software co-design systematically to manage low power con- 
sumption and high reliability. 

Besides the research projects on the basic level for sensor node architecture devel- 
opment as mentioned above, other projects on the network communication level is 
also undertaking at ICTCAS, including the study on multi-hop self-contained sensor 
network architecture, the development of related communication protocols, middle- 
ware and algorithm, complex power saving modes of operations and so on. 
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Abstract. Middleware support is a major topic in pervasive computing. Exist- 
ing studies mainly address the issues in the organization of and the collabora- 
tion amongst devices and services, but pay little attention to the design support 
of context-aware pervasive applications. Most of these applications are required 
to be adaptable to dynamic environments and self-managed. However, most 
context-aware pervasive applications nowadays have to carry out tedious tasks 
of gathering, classifying and processing messy context information due to lack 
of the necessary middleware support. To address this problem, we propose a 
novel approach based on ontology technology, and apply it in our Cabot pro- 
ject. Our approach defines a context ontology catered for the pervasive comput- 
ing environment. The ontology acts as the context information agreement 
amongst all computing components to support applications with flexible context 
gathering and classifying capabilities. This allows a domain ontology database 
to be constructed for storing the semantics relationship of concepts used in the 
pervasive computing environment. The ontology database supports applications 
with rich context processing capabilities. With the aid of ontology technology, 
Cabot further helps alleviate the impact of the naming problem, and support ad- 
vanced user space switching. A case study is given to show how Cabot assists 
developers in designing context-aware pervasive applications. 



1 Introduction 

A pervasive eomputing environment encompasses a spectrum of computation and 
communication devices that seamlessly augment human thoughts and activities [1]. 
Due to the non-trivial context management inherent in pervasive computing, a suit- 
able software infrastructure is needed to assist the development of context-aware per- 
vasive applications. We refer the context of a computation task to as the circum- 
stances or situations in which the task takes place. Most context-aware pervasive ap- 
plications are required to be adaptable to highly dynamic environments and self- 
managed. Therefore, the design of such applications is a challenging research issue. 

At present, developers of context-aware pervasive applications need to write tedi- 
ous and repetitive codes to handle context management, which concerns the following 
three functions: 
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• Context gathering: Gather proper context information from relevant context 
sources in a flexible way rather than specifying them explicitly. When an applica- 
tion is interested in object movement, the middleware should be able to select 
proper sensors to collect context information about object movement. 

• Context classifying: Classify context information into different categories in an 
application-specific way. An application may hope to analyze a certain scenario 
where the subject is “human being”, the action is “enter” and the area is “office 
4208”. The common context classification is only based on context type (e.g. 
sound, location, temperature, etc.), which cannot meet such requirements. 

• Context processing: Support applications with stronger context processing capa- 
bilities, e.g. context reasoning (knowing “car” is a subclass of “vehicle” helps an 
application interested in vehicle movement collect context information about cars) 
and context filtering (filtering certain context information for privacy purpose). 

Existing studies on the middleware support mainly address the issues in the organi- 
zation of and the collaboration amongst devices and services in the pervasive comput- 
ing environment, but pay little attention to the design support of context-aware perva- 
sive applications. None of proposed middleware infrastructures like Gaia [1], Easy- 
Living [2], i-Land [3], Aura [4] and Interactive Workspaces [5] can effectively assist 
application developers to handle all the above tasks. 

Other studies focusing on context-awareness in [7] [8] [9] mainly analyze some use- 
ful features of context information and propose some helpful frameworks, yet still 
leaving the context processing duties to clients. 

In this paper, we propose a novel approach based on ontology technology, and ap- 
ply it in our Cabot project. Three important concepts, namely, context ontology, con- 
text pattern and context matching will be defined. Users use context patterns to sub- 
scribe their interested context information, while the middleware uses these context 
patterns to execute context matching for users. Context pattern helps implement flexi- 
ble context gathering and classifying, and also contributes to enhancing applications 
with stronger context processing capabilities. 

The remainder of this paper is organized as follows: Sec. 2 introduces related work 
in recent years; Sec. 3 presents the Cabot project - a software infrastructure support- 
ing context-aware pervasive applications built on ontology technology; Sec. 4 further 
talks about some relevant issues about Cabot, Sec. 5 is a case study; and the last sec- 
tion concludes our contributions and explores future work. 



2 Related Work 

Existing studies on context-awareness are mostly concerned with either the frame- 
works that support the abstraction of context information or the context models that 
support data queries. Some typical works includes Cooltown project [7], Sentient 
Computing project [8] and Owl context service [9]. Their proposed context models 
generally lack formal bases; some of them even ignore the temporal aspects of context 
information. 

Published research projects in the middleware support for pervasive computing in- 
clude Gaia, EasyLiving, i-Land, Aura and Interactive Workspaces. 
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Gaia is a middleware project focusing on general-purpose pervasive environment. 
It makes use of active spaces [1] to encapsulate all low-level devices and services to 
provide a uniform interface such that developers can utilize and control the pervasive 
computing environment more easily. Aura is similar to Gaia, but uses a different ap- 
proach. Aura has a context observer to monitor environmental changes that would 
trigger Aura to perform pre-defmed actions. Each environment is managed by a dis- 
tinct Aura system, and multiple Aura systems can cooperate to perform tasks. 

i-Land works in a special environment that consists of a DynaWall, an Inter- 
acTable and a CommChair [3]. DynaWall is a wall-size touch screen, while Inter- 
acTable is a display on table. CommChair is a chair with computer network support. 
All devices can interact with each other and serve for presentations and discussions. 
Interactive Workspaces is another project sharing the same objectives with i-Land. It 
mainly focuses on the collaboration between a PDA and large screen projectors. 

EasyLiving is a computer-centric system focusing on the living environment. A 
typical living environment has projectors, wireless keyboards, mice, finger-print rec- 
ognizers, cameras, etc. Cameras can capture events in the house, and the images will 
be used for recognizing people and tracing their locations. 

These projects work on the management of computing resources, while Cabot fo- 
cuses on how to flexibly gather and classify context information and make further 
processing including context reasoning and context fdtering. 



3 Cabot System Architecture 

In Cabofs point of view, a complete pervasive computing environment is composed 
of Application Layer, Middleware Layer and Sensor Layer (Fig. 1). 



Application Layer 



Middleware Layer 



Sensor Layer 




Fig. 1. The Cabot system architecture 
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Context-aware pervasive applieations run at the Applieation Layer. This layer has 
complete client support in terms of APh. Applications can use context pattern APh to 
manage (subscribe, update or remove) their own context patterns. Other APh include 
user space APh and privacy APh. They are related to user space management and 
privacy services respectively. An application framework is provided for application 
development. Usually, users do not have to pay attention to the details of communica- 
tion with the middleware. They only need to focus on application logics, that is, make 
clear what their interested context information is and how to handle it. 

The Middleware Layer is the kernel part. This layer implements five fundamental 
functionalities: (a) application management to be in charge of all registered applica- 
tions, (b) context pattern management to be responsible for context pattern manipu- 
lations, (c) context pattern matching to be invoked automatically when the middle- 
ware receives any incoming context information, (d) context semantics reasoning to 
infer the semantic relations between concepts for reasoning, and (e) third-party ser- 
vice management to allow the middleware to integrate external context filtering ser- 
vices (e.g. privacy services) such that further context processing can be facilitated. 
The privacy services currently provided allows to modify or to hide some certain 
kinds of context information based on user identities and relevant privacy policies. 

A concept related to the Sensor Layer is active entity. Active entities can be physi- 
cal devices, software components or human beings. They periodically or non- 
periodically send context information to the middleware. Physical devices collect 
sensed context information (e.g., Tom enters into office 4208); software components 
generate derived context information (e.g., Cindy is busy); and human beings supply 
profiled context information (e.g., Cedric is supervised by Prof Cheung). We regard 
each “qualified” active entity as a sensor agent. By “qualified”, we mean that each 
active entity can exchange context information with the middleware based on a pre- 
defined context ontology. 



4 Main Cabot Features 

4.1 Context Ontology and Context Pattern 

Most middleware infrastructures have limitations in supporting applications to flexi- 
bly subscribe context information. Usually, context subscription is based on context 
type. It may be inconvenient when users want to gather the context information men- 
tioned in Sec. 1. Due to lack of the necessary support, users have to gather all relevant 
context information, and do analysis by themselves. This increases the network traffic 
in context transmission and the analysis workload in context processing. 

Our approach is based on ontology technology. We propose context ontology, an 
ontology document catered for the pervasive computing environment. The context 
ontology acts as the context information agreement to which all applications, sensor 
agents and the middleware should conform in pervasive computing. Fig. 2 illustrates 
some major concepts (classes) and relations (properties) in the context ontology. 

An environment context is defined by instantiating each ontology concept. When 
only part of ontology concepts is instantiated, it is called a context pattern. Applica- 
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tions subscribe their interested environment contexts to the middleware by means of 
context patterns. 




Fig. 2. The context ontology 



4.2 Context Matching and Concept Semantics Reasoning 

Cabot performs context matching between received environment contexts and sub- 
scribed context patterns. Both of them are transmitted, stored and processed in XML 
documents in practice. So an efficient tool for managing XML documents and an ex- 
pressive language for describing matching rules are imminent. We utilize xlinkit to 
perform context matching. It is a software framework for checking the consistency of 
distributed XML documents. It comprises a rule language based on First Order Logic 
(FOL) and XPath notation [6]. For each incoming environment context, xlinkit checks 
whether it can be matched for any context pattern stored in the pattern repository ac- 
cording to pre-defined matching rules. The matching rules are written like this: 

<forall var="context" in="/Context"> 

<not><exists var="pattem" in="/Repository/hasPattem/Pattem"> 



</exists></not> 

</forall> 

The omitted part is the kernel matching criteria that can be classified into three 
modes: exact matching mode, equivalent matching mode and plug-in matching mode. 

If we require that a matching is recognized when a concept has exactly the same 
value in both the environment context and the context pattern, it is called exact 
matching mode. In the equivalent matching mode, the semantics relation between 
two concepts is identified to check equivalence. For example, when “weather” and 
“climate” or “enter” and “come into" appear in pairs, a matching is recognized. The 
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plug-in matching mode further allows a context pattern to concern richer context 
information. When a more specific concept (say “car”) encounters a more general 
concept (say “vehicle”), this mode accepts it. The context matching example in Fig. 3 
adopts all the three matching modes. 



<?xml version=‘‘1.0" standalone 
• <Context> 

- <hasTime> 

- <Time> 

<year>2004</year> 

<mQnth>2</month> 

<day>28</day> 

<hour>22</hour> 

<minute>0</minute> 

<second>8</second> 

</Time> 

</hasTime> 

- <hasSite> 

- <Site> 

<ar Qa>office </area> 
^<p2>4208</p2> 



environment 

context 



<?xml version="l,0" standalone="no" ?> 

- <Pattern> 

- <hasTime> 

- <Time> 

<year>-l</year> 

<month>-l</month> 

<day>-l</day> 

<hour>-l</hour> 
<minute>-l</minute> 
<second>-l</second> C0M1*6X1* 
</Time> 

</hasTime> 

- <hasSite> 

- <Site> 

<a rea>office </area> 

f<7pT^ 

^p 2 > 42 O 8 </p2^ 



pattern 



</site> exact matching mode</sits> 

</hasSitB> . </hasSite> 

- <hasEvent> plug-in matching mode - 

l</category> <catec 

/subjectld> 




ct> 




<objectId>87654321</objectId> 

matching mode,/,!™^, 

</Contexl> </Pattern> 



<subie 

bn>com 0 into</a^on> 
<objff 

<objectId>-l</objectId> 



Fig. 3. A context matching example 

When all concepts between an environment context and a context pattern are 
matched, Cabot asserts this environment context to be “qualified” for this context 
pattern. 

The use of xlinkifs built-in comparison operators is not enough for supporting con- 
text matching. So an operator special for concept semantics reasoning is required in 
Cabot implementation. This operator acts as the interface of a concept semantics rea- 
soning subsystem built on a pervasive computing domain ontology database. 

The domain ontology database stores much knowledge on semantics relations be- 
tween concepts used in the pervasive computing environment. For example, 
“weather” is similar to “climate”, and “car” is a subclass of “vehicle”. Based on the 
domain ontology, the reasoning subsystem infers the semantics relation between two 
concepts as equivalent, subsumed, including, intersecting or disjoint. 

The inferred semantics relation is the foundation of context matching. Let a con- 
cept in the environment context be C\, and the counterpart in the context pattern be C2'. 

• Exact matching: Ci and C2 are said to be matched when they are exactly the same; 

• Equivalent matching: Ci and C2 are said to be matched when they are exactly the 
same, or have an equivalent relation; 

• Plug-in matching: c\ and C2 are said to be matched when they are exactly the 
same, or have an equivalent or subsumed relation. 

Some knowledge on concept semantics relations (e.g., “desk” is similar to “table”) 
helps implement some special tasks (e.g., monitoring the abnormal movement of ta- 
ble-like things). Another usage of the ontology reasoning is to alleviate the naming 
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problem across different sensor agents. For example, having known that “light” is 
similar to “lighting/ray/beam”, a light-detecting application can behave better when 
facing different naming standards. In order to have applications enhanced with some 
certain reasoning capability, Cabot needs to incorporate the corresponding ontology 
related to the targeted application scenario. 

4.3 User Space Switching and Application Framework 

Available resources in pervasive computing are inclined to change. This could affect 
applications unexpectedly. Cabot allows switching of user spaces to help applications 
adapt themselves to the changeable environment. Each user space represents a space 
that contains context information relevant to the context patterns of this user space. 

Cabot also provides a default application framework. This framework utilizes the 
Cabot APIs to set up an asynchronous and context-driven programming model that 
adopts context subscription and callback handling technology. 



5 A Case Study 



Fig. 4 illustrates a computing environment. 
Room A is a printing room. Room B is a 
computer bam, and Room C is another 
computer bam. Any user to Room B or 
Room C will pass the Gate first. 

An administrator, Peter, responsible for 
equipment maintenance usually stays in 
Room A, supplying printer paper when 
necessary and monitoring the coming users. 
Sometime, he goes to Room A and Room B 
to check whether everything is going well. 

Suppose that temperature and sound con- 
text information is required to evaluate the 
PC status in Room B. But for Room C, ad- 



PC1 PC2...PC7 (P 

□ tjy 

®® 

Space 2 Room 



Legend: 

0 Temperature sensor 
<§) Sound sensor 
<H) Humidity sensor 
(G) Liglit sensor 
(P) Printer status sensor 
(@> Movement sensor 
(p Location 




Fig. 4. A practical case 



ditional humidity and light context information is also needed. Peter hopes to know 
the current equipment status once entering Room B or Room C, and no matter in 
which room he is resident, continuous monitoring of printers and coming users is ex- 
pected. 

We assume that all required sensors have been installed properly (Fig. 4). The fol- 
lowing is the application design solution that comprises three user spaces (Fig. 4): 

• Space 1: (Gate + Room A) Activating condition: Peter leaves Room B or Room C. 
Context patterns: (1) printer (area: Room A, subject: printer), and (2) people (area: 
Gate, category: movement, subject: people, action: enter). 

• Space 2: (Gate + Room A -i- Room B) Activating condition: Peter enters Room B. 
Extra context patterns (to Space 1): (1) temperature (area: Room B, category: tem- 
perature, subject: PC), and (2) sound (area: Room B, category: sound, subject: PC). 

• Space 3: (Gate + Room A -i- Room C) Activating condition: Peter enters Room C. 
Extra context patterns (to Space 1): (1) temperature (area: Room C, category: tern- 
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perature, subject: computer), (2) sound (area: Room C, category: sound, subject: 
computer), (3) humidity (area: Room C, category: humidity, subject: air), and (4) 
brightness (area: Room C, category: light, subject: fluorescent lamp). 

Cabot supports this application with two distinct capabilities: (1) context reasoning 
(e.g. “someone comes into ...” = “somebody enters ...”); (2) context subscription 
with plug-in matching (e.g. “computer” = “PC” -I- “workstation” -I- “mainframe” in 
Space 3). 



6 Conclusions and Future Work 

In this paper, we have overviewed several existing middleware infrastructures for 
pervasive computing. Their supports of context management are inadequate. To ad- 
dress this problem, we develop Cabot with the use of ontology technology. 

A useful concept, context pattern, is introduced into Cabot to facilitate the context 
gathering, classifying and processing. In order to alleviate the naming problem and to 
enhance the expressiveness of context patterns, Cabot supports three flexible context 
matching modes. Cabot also allows the automatic and manual switching between user 
spaces to help realize adaptable context-aware pervasive applications. 

At present, Cabot is still at a prototype stage. New functionalities and features (e.g. 
context trigger and context deriving) will be incorporated into the future releases of 
Cabot. 
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Abstract. Wireless sensor networks (WSNs) have a wide range of useful, data- 
centric applications, and major techniques involved in these applications in- 
clude in-network query processing and query-informed routing. Both tech- 
niques require realistic environments and detailed system feedback for devel- 
opment and evaluation. Unfortunately, neither real sensor networks nor existing 
simulators/emulators are suitable for this requirement. In this design paper, we 
propose a distributed sensor network emulator, a Virtual Mote Network 
(VMNet), to meet this requirement. We describe the system architecture, the 
synchronization of the nodes and the virtual time emulation with a focus on 
mechanisms that are effective for accurate emulation. 



1. Introduction 

Wireless sensor networks (WSNs) enable applications to obtain up-to-date informa- 
tion about the physical world. This information is especially valuable for environ- 
ments in which it is inefficient, difficult or dangerous for people to collect data on site 
by themselves. However, such environments also make it hard to study techniques 
for data-centric WSN applications in real sensor networks. Furthermore, major tech- 
niques in data-centric WSNs, such as in-network query processing [8][11] and query- 
informed routing [3], need a realistic development and evaluation environment with 
system feedback at a suitable level of detail. In this design paper, we propose to de- 
velop an accurate sensor network emulator in order to facilitate studies of techniques 
for data-centric WSN applications. 

Traditionally, simulators and emulators are useful tools for networking research in 
that they simulate or emulate real networking protocols and provide a controllable 
environment for studies. This usefulness is even greater for WSN applications, be- 
cause real WSNs are in frequent upgrades and their deployment is tightly embedded 
in the physical environment. As evidence, several sensor network simulators and 
emulators [9] have been developed for large-scale WSN studies. 
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If we focus on developing and debugging WSN applictions in a realistic environ- 
ment, existing tools such as TOSSIM [7] and EmStar [4] are excellent choices. How- 
ever, with the advance of data-centric WSN applications, such as environmental 
monitoring and assisted living, more requirements for simulation and emulation are 
posed for developing and evaluating techniques in these applications. For instance, 
in-network query processing and query-informed routing, two major cross-layer tech- 
niques for data-centric WSN applications, require the WSN to return information 
about sensor node power consumption and response time in order to make decisions 
for network routing and query processing. Evaluation of alternatives of each tech- 
nique also requires this information for performance comparison. Unfortunately, 
current WSN simulators/emulators are insufficient to address this need. Specifically, 
an accurate emulation of timing and power consumption for node execution and 
communication is missing in current WSN simulators and emulators. 

Aiming at accurate emulation of a WSN for data-centric applications, we propose a 
WSN emulator called a Virtual Mote Network (VMNet). A VMNet consists of vir- 
tual sensor nodes connected through a virtual channel. Each virtual sensor node in 
turn consists of an emulated CPU as well as emulated hardware peripherals (e.g., 
sensing units and radio frequency units). The emulated CPU executes software that 
can run on real sensor nodes and reports execution time at the granularity of the emu- 
lated CPU cycle. The emulated hardware peripherals generate interrupts with realistic 
delays. The virtual channel is emulated through UDP (User Datagram Protocol) on 
networked PCs with emulated bit errors, delays, and packet collision. Putting all 
these units together, the timing information of the software under study (e.g., an in- 
network query processor or a query-informed router) can be accurately emulated and 
be fed back for the execution and evaluation of the software. 

The remainder of this paper is structured as follows. Section 2 introduces the 
background of our work. Section 3 presents the design of VMNet, including the ar- 
chitecture and components. Section 4 discusses related work briefly and Section 5 
concludes. 



2, Background 

2.1. Terminology 

The following terms are used throughout the paper: 

Node and Mote: both refer to a sensor node consisting of computation, sensing, 
and communication units. The two terms are used interchangeably. 

Real (Target) vs. Virtual (Emulated): A real or target component is one in a real 
WSN and its counterpart in VMNet is virtual or emulated. For instance, a real CPU is 
in a real sensor node and a virtual CPU in a virtual mote. Similarly, we refer to the 
execution time of real software being emulated as the virtual time (not the time of 
executing the emulation itself). 
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2.2. Overview of a Target WSN 

Fig. 1 shows a typical WSN. The sensor mote in the WSN is MICA2 by crossbow 
[2]. We choose MICA2 as the target because it is most commonly used. 




Fig. 1. A Typical WSN 

The WSN in Fig. 1 consists of several components, each of which performs differ- 
ent functions. Table 1 lists these components and their composition. Each sensor 
node in the WSN runs application software (e.g., a query processor) developed using 
TinyOS [3]. The sink node acts as the root of the WSN and communicates with the 
data proxy. The data proxy in turn communicates with the user client. Note that the 
data proxy and the user client can be on the same PC. 



Table 1. Components of a WSN 



Components 


Composition 


Sensor mote 
(MICA2) 


A main board (MPR410) with the Atmega 128, 8 bit, 7. 3827MHz CPU and the 
Chipcon CCIOOO, 38kb/s, CSMA radio circuit, and a sensor board (MTS300) 


Sink node 


A main board (MPR410) and a PC interface card with a serial port 


Data proxy 


A PC that communicates with the sink node via a serial port 


User client 


A PC that runs the user interface program 



The operation of the WSN is as follows: Use the user client to post commands and 
queries. These queries are parsed by the data proxy and are disseminated via the sink 
node to the network. If a mote acquires data that satisfy a query, it sends the sensory 
data tuples to the data proxy through the sink node. The data proxy forwards the 
result to the user client. 



3, VMNet Design 

Our VMNet is designed via a divide-and-conquer approach. First, we analyze the 
target WSN, and divide the WSN into components. Second, we design the architec- 
ture of the emulator based on the architecture of the real WSN. Third, we design each 
emulated component based on its counterpart in the real WSN. 
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3.1. VMNet Architecture 

The architecture of VMNet (Fig. 2) resembles that of a real WSN. It consists of the 
virtual sink node (VM 0) and other virtual motes connected through virtual channels. 
Real application software runs on the virtual motes for sensing, processing, and rout- 
ing. Emulated radio signals travel on the virtual channels. Additionally, the Applica- 
tion User Interface (AUI) and the Network Manager (NM) reside on VMO for applica- 
tion management (corresponds to the data proxy and the user client in the real WSN) 
and network emulation management respectively. 




Fig. 2. VMNet architecture 



VMNet is designed to be a distributed system in order to achieve fast emulation, 
high accuracy and scalability. It employs a wired Local Area Network (LAN) to 
emulate the wireless network. Each real mote in the target WSN is emulated by a 
program running on the LAN. From our past experience [13], this parallel architec- 
ture that VMNet adopts has shown high fidelity and scalability in emulating general 
wireless networks. 

In brief, the architecture of VMNet abstracts the common features of WSNs. Al- 
though VMNet can only emulate one type of WSNs at one time, the generality of its 
architecture makes it easy to switch to other WSNs. 



3.2. Virtual Mote 

A virtual mote (VM) has three components: the virtual hardware, the real software 
and the Emulation Manager (EM). Fig. 3 shows the structure of a VM. 
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Virtual Mote 




Fig. 3. Structure of a VM 

The virtual hardware is the emulation of a mote’s hardware (MICA2 in this paper). 
It is eomposed of the following units: a virtual CPU with a virtual eloek, a Virtual 
Radio Frequeney Module (VRFM) and a virtual sensor board. In the virtual sink 
node, there is a virtual UART, whieh emulates the serial port of the sink node. The 
virtual hardware units are the same type as their eorresponding real hardware. For 
instanee, the virtual CPU emulates the Atmega 128 CPU of MICA2 mote. 

The virtual CPU and the virtual eloek in a VM are eritieal for the aeeuraey of emu- 
lation, beeause they eontrol exeeution and timing. The virtual CPU parses the exeeu- 
table binary eodes of a mote and exeeutes them. It also interaets with other virtual 
hardware units via the virtual I/O ports. The virtual eloek is ineremented by the vir- 
tual CPU per mote CPU eloek eyele and reeords the virtual time in a VM. 

The emulation manager manages mote emulation and logs the emulated aetions, 
the exeeution time of various modes and runtime status of these units. 

VM 0 (the virtual sink) is different from other VMs in that it has the virtual UART 
but no virtual sensor board. The differenee is emulated by the gateway software, 
whieh operates on the virtual UART and disregards the virtual sensor board. 

All three eomponents - the virtual hardware, the real applieation software and the 
emulation manager, are separate from one another in a VM. There is a elear interfaee 
between the three eomponents. This design ensures the reusability of our emulator 
when the target applieation or hardware ehanges. It also provides a reasonable solu- 
tion to the eonfliet between the generality and the aeeuraey (speeifieity) of emulation. 



3.3. Virtual Channel 

The virtual channel generates network effects using three software modules: the bit 
error module, the delay module and the collision module (shown in Fig. 4). Let us 
first describe the transmission process of data on a virtual channel connected with a 
VM. When outgoing bits are sent from the Virtual Radio Frequency Module (VRFM) 
of the VM to the virtual channel, they pass through the three modules and stay in a 
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buffer (in the lower right corner of Fig. 4) for wrapping. When all bits of a packet 
arrive in the buffer, the virtual channel wraps them into a packet and sends out the 
packet via UDP. When an incoming UDP packet arrives at the virtual channel, it is 
put into a queue (lower left of Fig. 4) and is decomposed into bits to be sent to the 
VRFM of the VM via another buffer (on the left of Fig. 4). 




Fig. 4. Virtual Channel 



The bit error module uses an experiential radio signal error data model to generate 
the error rate. The bit error rate model is a table with two attributes: distance and bit 
error rate, which is defined as: (number of error bits received by the receiver) / (num- 
ber of total bits sent by the sender). The module randomly generates the bit error at a 
rate that the table specifies. 

The transmission delay module adds a delay to the virtual time of the outgoing 
packet. The collision module emulates radio signal collision by performing two op- 
erations: carrier sense and collision. Both operations need information about the 
virtual time and the data transmission status of all VMs. This information is kept in 
the Network Manager. 

In the carrier sense operation, the collision module asks the network manager 
whether if a sending VM can hear any VMs that are transmitting data. If so, the send- 
ing VM will wait a random time defined by the network protocols. In the collision 
operation, the collision module destroys the current bit to be sent on one of the two 
conditions: (1) another VM is transmitting and the sender of the bit can hear that 
transmitting VM, or (2) another VM is sending to the same destination as this sender. 



3.4. Virtual Time 

The major criterion for the accuracy of our emulation is the emulated time, or the 
virtual time. Mathematic models are one way to estimate time, but it is hard to 
achieve a high accuracy with simple models. In VMNet, we follow the approach of 
real execution. That is, the emulator executes the real software and measures the 
virtual time of the execution. 

We have described the timing mechanism in the virtual CPU with a virtual clock in 
Section 3.2. Moreover, the working time of hardware peripherals such as sensing 
time and transmitting time are also emulated. Let us take the virtual sensor board as 
an example. When the virtual sensor board receives a command from the virtual CPU, 
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it checks the virtual clock and after a delay (the length of delay is based on measure- 
ments in real systems) sends an interruption signal to the interruption port connected 
to the virtual CPU. When the virtual CPU receives the interruption signal, it executes 
the sensor data interruption service program. Therefore, the virtual time together with 
the interruption is accurately emulated. The timing and the interruption of RFM and 
other hardware peripherals are emulated similarly. 

Since the sleep mode of real motes is important for power efficiency, accurate 
emulation of WSN should consider the sleep mode. After the virtual CPU executes 
the “sleep” instruction, it should sleep until there is a timer interruption The VM ad- 
vances its virtual clock by the sleep time, reports its status to the network manager 
and waits for synchronization. 

Up to this point, we have discussed time emulation for individual VMs. Because 
VMs run simultaneously, synchronization is needed to ensure that the messages and 
the operations of VMs are in the same order with that of the target WSN. 

The synchronization procedure is as follows: At the startup time, the network man- 
ager initializes its table of network status information including the total number of 
VMs n and the value of the virtual clock of each VM: vto, vt]... Whenever the 
VMs run for a predefined interval T which is called the synchronization interval, they 
pause and report to the network manager. After every VM has reported to the net- 
work manager that its virtual clock has advanced by T, the network manager sends 
out a broadcast message to inform the VMs to resume running. 

It is possible that when the fastest VMs (with a virtual time vt) are waiting, other 
VMs may exceed the fastest VMs. This does not matter as the message order is not 
affected and the exceeded time will be synchronized in next interval. In order to en- 
sure correct ordering of messages, the virtual channel queues received UDP packets. 
The packets are sorted by their virtual time in the ascending order. This queue 
method avoids the semantic error: When a message is processed, it finds there is an- 
other message in the buffer that should be processed earlier. 

In summary, the virtual time is carefully emulated in VMNet for accuracy. VMNet 
adopts a virtual CPU with a virtual clock for each VM to manage the timing. The 
virtual hardware peripherals of a VM generate interruptions with realistic delays. The 
sleep mode of a VM is considered and is gracefully handled. Synchronization of 
multiple VMs is performed periodically to ensure the correctness of emulation. 



4. Related Work 

Previous work, including Glomosim [12], Maisie[10] and SWiMNET [1], has shown 
that parallel and distributed architectures can speed up simulations. In this direction, 
our effort on VMNet is an outgrowth of our previous work on a distributed wire-line 
and wireless network emulation framework EMPOWER [13]. Our previous work 
EMWIN [14] gives the experiences in emulating wireless networks. 

In the area of sensor network simulation and emulation, UC Berkeley’s TOSSIM 
[7] simulates the network at the bit level. It is useful for debugging applications but it 
has not provided detailed timing information of the target. UCLA’s EmStar [4] [6] is 
another simulator of WSNs. It has not focused on detailed performance evaluation of 
the target WSN yet. 
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5. Conclusions and Future Work 

We have presented our design of VMNet, a distributed emulator for WSNs with accu- 
rate system feedback. We are currently implementing VMNet, with a focus on the 
accurate estimation of application execution time. We plan to add power consump- 
tion emulation as well as mobility emulation to VMNet in the near future. 
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Abstract. Location estimation and user behavior recognition are research issues 
that go hand in hand. In the past, these two issues have been investigated sepa- 
rately. In this paper, we present an integrated framework called LEAPS (location 
estimation and action prediction), jointly developed by Hong Kong University of 
Science and Technology, and the Institute of Computing, Shanghai, of the Chi- 
nese Academy of Sciences that combines two areas of interest, namely, location 
estimation and plan recognition, in a coherent whole. Under this framework, we 
have been carrying out several investigations, including action and plan recogni- 
tion from low-level signals and location estimation by intelligently selecting ac- 
cess points (AP). Our two-layered model, including a sensor-level model and an 
action and goal prediction model, allows for future extensions in more advanced 
features and services. 



1 Introduction 

In recent years, the research area of indoor location estimation and user behavior pre- 
diction have attracted intense attention. Much work has been done in the computer net- 
work and pervasive computing areas on using the signal strength values from the access 
points (AP) to determine locations using various geometric and probabilistic knowl- 
edge. Similarly, different statistical models have been proposed in artificial intelligence 
and data mining area for recognizing and predicting a user’s behavior and plans. How- 
ever, no work has attempted to combine location estimation with high-level behavior 
recognition, and to use machine learning and probabilistic reasoning to help with low- 
level location estimation. In this paper, we survey the work in location estimation and 
high-level plan recognition, and present our own integrated framework for accomplish- 
ing both tasks. Through our integrated system known as LEAPS (Location Estimation 
and Action Prediction System), we present our research results in three different tasks, 
including using machine learning methods for access point selection and using location 
estimation for high-level goal recognition. 
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Our two-layered model for LEAPS is shown in Figure 1, which includes a sensor 
model, and an action and goal recognition model. This framework allows several dif- 
ferent tasks to be accomplished, ranging from low level to high level. It also allows for 
more advanced extensions in the future. Below, we discuss the different layers in turn. 



Goal 



Action 



State 



Signal 

Strengths 




Action 

Model 



Sensor 

Model 



Fig. 1. Two level model for the LEAPS framework 



2 Wireless Environment for LEAPS 

Our experimental test bed is set up in the faculty office area of Computer Science De- 
partment in the Academic Building of Hong Kong University of Science and Technol- 
ogy. The building is equipped with IEEE 802.1 Ib wireless Ethernet network in the 2.4 
GHz frequency bandwidth. The layout of the floor is shown in Eigure 2. Experiments 
were carried out in the five hallways (HW1~HW5) and two printing rooms as labeled 
in the figure. The five hallways are labeled as HWl to HW5. There are also two rooms 
in the area. 

There are a total of 25 base stations that are detectable in the environment, of which 
three base stations that are distributed within the area are marked with concrete circles 
in the figure. Among the other 22 bases stations, some are located on the same floor out- 
side of the area while the others are located on the different floors. The IEEE 802.1 lb 
standard works over the radio frequencies in the 2.4 GHz band. However, accurate lo- 
cation estimation using measurements of signal strength is a longstanding complex and 
difficult task due to the noisy characteristics of signal propagation. Subject to reflec- 
tion, refraction, diffraction and absorption by structures and even human bodies, signal 
propagation suffers from severe multi-path fading effects in an indoor environment. As 
a result, a transmitted signal can reach the receiver through different paths, each hav- 
ing its own amplitude and phase. Figure 3 gives a typical example of the normalized 
histogram of the signal strength received from a base station at a fixed location. 
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Fig. 2. The layout of the HKUST office area for the Wireless LAN experiments 



3 Indoor Location Estimation at the Sensor Level 

Deterministic Techniques In general, the location estimation research can be classified 
into two categories: deterministic techniques and probabilistic techniques. Determinis- 
tic techniques [2, 8] use deterministic inference methods to estimate a user’s location. 
The RADAR system by Microsoft Research [2] proposes nearest neighbor heuristics 
and triangulation methods to infer a user’s location. It maintains a radio map which 
tabulates the signal strength received from different access points at selected locations. 
Each signal-strength measurement is then compared against the radio map and the coor- 
dinates of the best matches are averaged to give the location estimate. The accuracy of 
RADAR is about three meters with fifty percent probability. The LANDMARC system 
[8] exploits the idea of reference points to alleviate the effects caused by the fluctuation 
of RFID signal strength. The accuracy is roughly one to three meters. However, the 
placement of reference tags should be carefully designed since it has a significant effect 
on the performance of the system. Moveover, the RFID readers are so expensive that it 
is infeasible for localization in a large area. 

Probabilistic Framework Another branch of research is the probabilistic techniques 
[14, 13, 11,6] construct a conditional probability distribution over locations in the en- 
vironment of interest. In [6], Ladd et al. uses probabilistic inference methods for local- 
ization. They first use Bayesian inference to compute the conditional probability over 
locations, based on received signal-strength measurements from nine access points in 
the environment. Then a postprocessing step, which utilizes the spatial constraints of a 
user’s movement trajectories, is used to refine the location estimate and reject the results 
with significant change in the location space. Depending on the postprocessing step is 
used or not, the accuracy of this method is 83% and 77% within 1.5 meter. 
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Fig. 3. An example of signal strength distribution 



The LEAPS Framework to Location Estimation Our LEAPS framework for location 
estimation is divided into two phases. The first phase is done offline, where the main 
purpose is to perform intelligent AP selection. We divide this phase into the following 
steps: 

1. First, a feature selection algorithm is applied to find a subset S of AP’s that can 
give the best performance. This subset will then be used as the basis for subsequent 
computation. 

2. A subsequent clustering analysis is then applied to the set S and data collected in 
the offline phase, in order to partition the grid space into clusters. Each cluster will 
then provide a subsequent location model. 

3. Finally, a decision tree model is constructed for each cluster, based on the AP’s 
given in S. For each cluster only a subset of AP’s from S is selected, which further 
reduces the number of AP’s needed for location estimation within each cluster. 

The second phase is an online phase, in which a new trace of signal strengths is taken 
as input and the current location is estimated. This phase is done in two steps: 

1. First, the signal strength values from the selected AP’s from the set S is used to 
determine the cluster the current client is most likely located within. 

2. Then, the decision tree from the identified cluster is used to determine, at a finer 
level, which grid the client belongs to. This step will use a subset of the AP’s given 
in S, which further reduces the number of AP’s used in a computation. In addition, 
the AP’s that are used only involve arithmetic comparison, which is one of the 
cheapest computations as computational energy is concerned. 
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As one of our experimental results, we show in Figure 4 the relation between dif- 
ferent AP selection methods and accuracy. We can see that in order to achieve the ac- 
curacy of 90%, the number of access points required by each location estimation are 
12, 18 and 24, according to information gain (InfoGain), MLE used in previous work 
of [14] (MaxMean) and AP selection in the opposite order of information gain ordering 
(MinMean), respectively. It can be concluded that the InfoGain criterion uses the fewest 
access points to achieve the same level of accuracy, which achieves much better result 
as compared to previous methods. 




100 



Number of AP 



Fig. 4. AP selection and its impact on accuracy 



4 User Behavior Recognition in LEAPS 

Being able to predict a user’s location is one thing, but the eventual purpose of location 
estimation is to infer users’ high-level behavior patterns from low-level sensory data, 
and provide useful services. Such a task is what we call location-based plan recognition. 
Being able to accomplish this task is critical to many applications. For people suffering 
from various cognitive limitations in hospitals and care facilities, the technique can 
discover when a person’s behavior is out of the norm and provide help in a timely 
manner [9]. For shoppers in a busy business environment such as a shopping mall, 
services and products can be offered not only according to people’s current location, 
but also according to their intended goals and actions. 

In our current work, we have taken a first step in inferring high-level user goals 
from low-level mobile data in an indoor environment, where a wireless LAN is avail- 
able. In this section, we summarize our statistical model for goal recognition, where a 
full report can be found in [12], where a two-level dynamic Bayesian network model 
(DBN) is applied. This model integrates DBN with and a fast-inferencing model based 
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on n-grams for goal inference. We show that this architecture allows us to incorporate 
domain knowledge, whereby we achieve a nice tradeoff between model accuracy and 
inferencing efficiency. 

Previous Work on Plan Recognition In the artificial intelligence area, recognizing com- 
plex high-level behaviors has traditionally been the focus of plan recognition [5, 7]. A 
Bayesian network was used for plan recognition in story understanding [4] . In [3], a 
corpus-based N-gram model was introduced to predict the goal from a given sequence 
of command actions in the UNIX domain. In addition, other advanced stochastic mod- 
els for recognizing high-level behaviors were proposed such as Dynamic Bayesian Net- 
works (DBN) [1] and Probabilistic State Dependent Grammars [10]. However, most 
of the work in plan recognition has been restricted to the high level for inference, and 
the challenge of dealing with low-level sensor models has not been addressed. Only 
in recent years, attempts have been made to integrate high-level behavior models with 
low-level sensor models. The work of [9] presents an approach by applying a Bayesian 
model to predict a user’s transportation mode based on location readings from GPS 
devices in an urban environment. 

The Dynamic Bayesian Network Model To represent different degrees of uncertainty in 
a time series, researchers have proposed various probabilistic models. Among them, the 
dynamic Bayesian network model has been shown to be well suited for goal recognition 
tasks. It includes two time slices numbered t and t — 1 respectively. The shaded nodes 
SSit represent the strength variables of signals received from multiple base stations, 
which are directly observable. All the other variables - the physical location St of the 
user, the action At the user is taking and the goal Gt the user is pursuing - are hidden, 
with the values to be inferred from the raw signals. 

To balance between recognition accuracy and computational complexity, we pro- 
pose a novel two-level model. The complete DBN model is separated into two different 
models. The lower part of the DBN starting from the observation layer to the action 
layer corresponds to the lower level of the architecture. An N-gram model is applied 
above the action model to infer goals. 

In this framework, a low-level DBN model is responsible for computing the most 
plausible sequence of actions A\, A 2 , ■ ■ ■ , At from the observations 01 , 02 , ■■■ ,Ot ob- 
tained up to time t. The task is carried out by the method of smoothing, which esti- 
mates the hidden states of the past given all the evidence up to the current time point t, 
P{Ar\oi, 02 , ■ ■ ■ , Ot), where t < t.To test the validity of the model, we have collected 
570 traces for 19 goals of a professor to be modeled in the office area. Figure 5 shows 
the recognition process of one trace belonging to the goal ”Seminar-in-Rooml”, with 
respect to three other goals among the 19 goals. 



5 Conclusions and Future Work 

In this paper, we presented an integrated approach towards addressing the problem of 
inferring high-level goals from low-level noisy signals, as well as estimating a users’ 
current locations, in a complex indoor environment using an RF-based wireless LAN. 
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Fig. 5. Behavior recognition for four goals in our experiment 



For location estimation, we applied machine learning techniques to select an optimal 
subset of access points that allow the best balance in accuracy and power saving. For 
the task of action prediction and plan recognition, we model the problem based on the 
framework of dynamic Bayesian network, using a two-level n-gram based model. The 
experiments demonstrate that the two-level approach is more efficient than the whole 
DBN solution while the accuracy is comparable. The three-layered model, including the 
sensor model, the action model and the goal model, provides an integrated framework 
for location estimation and high-level goal recognition. 
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Abstract. In wireless sensor networks (WSN) the reliability of the sys- 
tem can be increased by providing several paths from the source node 
to the destination node and by sending the same packet through each 
of them (the algorithm is known as multipath routing). Using this tech- 
nique, the traffic increases significantly. In this paper, we analyze the 
combination between a new multipath routing mechanism and a data- 
splitting scheme that resnlts in an efficient solution for achieving high 
delivery ratios while keeping the traffic at a low value. Simulation results 
are presented in order to characterize the performances of the algorithm. 



1 Introduction 

Sensor nodes have many failure modes [4], each one of them decreasing the 
performance of the network. The algorithms presented in this paper will assure 
that the gathered data will reach its destination in the network by assuming 
as a regular fact that nodes may be not available during the routing procedure. 
Additional energy will be required only for a small amount of computations; this 
is almost negligible compared with the energy used for communications [5] . 

The algorithm starts by discovering n multiple paths from the source to the 
destination. Sending the same data over all discovered paths is a solution in case 
of node failures but it requires large quantities of network resources (such as 
bandwidth and energy) . Our contribution is to develop a new multipath routing 
algorithm that will discover several disjoint paths between a source an a destina- 
tion nodes. Then, we will make use of the Turbo Erasure Correction codes to split 
the original data packet into k parts (further referred to as subpackets) and then 
compute n-k redundant packets. Finally send these n subpackets instead of the 
whole packet, across n multipath. The basic principle is to transmit a sequence 
of n subpackets, out of which only k subpackets are necessary to reconstruct the 
original packet. The receiver’s robustness to missing packets is increased, which 
also implies that a return feedback channel is not needed anymore. 

This work is performed as a part of the European EYES project (IST-2001- 
34734) on self-organizing and collaborative energy-efficient sensor networks [2]. 
It addresses the convergence of distributed information processing, wireless com- 
munication and mobile computing. 
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2 Multipath On-Demand Renting 

Multipath Routing allows the establishment of multiple disjoint paths between 
source and destination, which provides an easy mechanism to increase the likeli- 
hood of reliable data delivery by sending multiple copies of data along different 
paths. Based on Dynamic Source Routing (DSR) [3], we designed a new mul- 
tipath routing algorithm Multipath On-Demand Algorithm (further referred to 
as MDR). The algorithm provides several paths from sources to destinations. A 
data splitting algorithm as presented in Section 3 will be used to safely route 
data while keeping the amount of traffic low. The two phases of the algorithm 
are described below. 



2.1 Route Request Phase 

The source initiates the Route Request phase by sending a request message to 
notify the destination that it has a packet for it. The route request message 
contains the following fields: 

— snodelD the source node ID 

— dnodelD the destination node ID 

— floodID the route request message ID 

— lasthop the ID of the node forwarding this message 

— ack the ID of the last hop 

Each node in the network has an unique ID and each message source main- 
tains a counter of the requests sent, such that each route request message in the 
network is uniquely identified by the first three fields. The ack field is needed 
to distinguish between the messages received by a node. In this way, a route re- 
quest message can be immediately classified as being received for the first time, 
or being just a passive acknowledgement of a previously sent message. 

When a source node has to transmit a message to a destination, it first 
checks its cache to see if there are any routes to that destination that did not 
expire. If the number of routes found is big enough for the maximum given failing 
probability of the nodes in the network, it uses them. If not, it generates a new 
route request message filling the ack field with its own ID. When receiving such 
a message, a node checks its local data structure to see if it has received another 
route request message having the same three fields identical. If not, it creates a 
new entry in the data structure and stores this information plus the ID of the 
node from which it received it. From additional messages received the node has 
to store only the name of the neighbor. It can easily check and mark if the source 
of the message is a first order neighbor by looking at the lasthop or ack fields. 

The node will forward only the first route request message it gets. It has to 
change only the ack field with the lasthop value and the lasthop with its own 
ID. After receiving several such messages each node knows who are its neighbors 
and more than that, which ones are closer to the source (further referred as the 
n-1 neighbor list) and which one closer to the destination (further referred as 
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the n+1 neighbor list). In fact, each data structure stores them in two separate 
lists according to the previous rule. If the node identifies itself as being the 
destination of the message it initiates the second phase of the algorithm. 

2.2 Route Reply Phase 

The Route Reply phase is the part of the algorithm in which several paths be- 
tween the destination and the source are reported to the source (if they exist). 
Because each intermediate node keeps information about the neighbors, the com- 
plete path between the source and the destination has not to be stored inside 
the route reply message, which contains the following fields: 

— snodelD the source node ID 

— dnodelD the destination node ID 

— floodID the fiood message ID 

— lasthop the ID of the node forwarding this message 

— nexthop the ID of the node to which the message is forwarded 

— ack the ID of the last hop 

— hops the number of the hops the message traveled through 

— detours the number of detours a message can take 

The nexthop field contains the ID of the node that has to receive this mes- 
sage. This information is provided by each node from their local data structure. 
The hops field is incremented with each hop the message travel and represents 
the current path length. The detours field specifies how many times the reply 
message is allowed to travel in an opposite direction (from source to destination) . 

When the first route reply message arrives at the source, this node stores 
the ID of the node that forwarded the message and the path length. It also sets 
up a timer to measure the interval that it will wait for other reply messages 
to come. When this timer expires it splits the original data message according 
to the number of paths, the maximum probability of failure and the length of 
the paths and forwards it. A node that receives a route reply addressed to it, 
will modify the last four fields of the message according to the new parameters. 
Afterwards, it will forward it to the first neighbor in the n-1 neighbor list. If 
this list is empty and the detours field is not empty, it chooses the first neighbor 
in the n+1 neighbor list and also decreases the detour variable by 1. A node 
that receives a route reply not addressed to it, searches its own data structure 
to find the entry corresponding to the first three fields. If such an entry is found, 
it removes the forwarding node from both n-1 and n+1 neighbor lists. 

A node that forwarded a message has to take care of two more things: first it 
sets a flag in his data structure saying that it will not forward any other message 
and second, it waits for the passive acknowledgement. If this does not arrive it 
assumes that the node to which it send the message is no longer there, is broken 
or it forwarded a message previously and it deletes it from his lists. It will try 
re-sending the message to the next neighbor in the lists, until the lists become 
empty or the detour field becomes 0. 
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The previous step of removing nodes from the list is needed to ensure that 
the source will receive only disjoint paths. If for various reasons, the paths from 
the destination to the source have to be known, each node that forwards a route 
reply message can append its ID to it. This way, the messages will grow in length, 
but this growth is controlled and involves only a subset of the nodes. 

3 Data Splitting Across Multiple Paths 

The route discovery process of MDR provides multiple disjoint paths between 
source and destination. In this section we try to predict the number of paths that 
will succeed in delivery subpackets. Furthermore, we will give an approximation 
that allows for a term that can be used to increase the chance of successful 
delivery of the entire message at the trade-off of added redundancy. 

Of course, one can send the whole message along each of the available paths, 
but the overhead induced by this will be too high. The entire data package to be 
sent from the source to the destination over the available n disjoint paths will 
be split up into smaller subpackets of equal size with added redundancy. The 
number of created subpackets corresponds to the number of available paths. Only 
a smaller number of these subpackets will then be needed at the destination to 
reconstruct the original message. In the following, we will focus on approximating 
a value k that gives, with high probability, the number of successful paths. This 
value will then be used to determine the amount of redundancy to be added for 
the split message transmission in Section 4. The total number of subpackets as 
well as the added redundancy is a function dependent on the multipath degree 
and on the failing probabilities of the available paths. As these values change 
according to the positions of the source and the destination in the network, each 
source must be able to decide on the parameters for the error correcting codes 
before the transmission of the actual data subpackets. 

Suppose we want to send a data package from a source to a destination and 
the process of MDR is finished with k different paths whose reputation coefficient 
is sufficient. Each path has some rate pi{i = 1, . . . ,n) that corresponds to the 
probability of successfully delivering a message to the destination. This setting 
corresponds to a repeated Bernoulli experiment. So the estimated number k of 
successful path is given by 

n 

k = '^Pi 

i=l 

When to estimate the number of successful path k, we consider the possibility 
of a packet failure independent of the packet size. The effect of packet size on 
the failure possibility depends on many different factors in the system. One large 
packet is better for MAC layer and OS scheduling, however has higher chance of 
failure during physical transmission. Many small packets put more burdens on 
MAC layer and OS scheduling, while each of them has better chance of success in 
physical layer. So in this paper, we do not consider the influence of packet size on 
the failing possibility. According to the previous work in [1] , we gave an estimate 
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for the number of successful paths with a given value a that serves as a desired 
bound on the delivery. Given the probabilities of failure for n disjoint paths in 
the multipath routing, the (maximum) number k of paths that are successful in 
1-a of all deliveries can be expressed as 



k = Xa 



n 



n 






+ '^Pi 

i=l i=l 



Xa is the corresponding bound from the standard normal distribution for 
different level of a (see Table 1 for some values) . Obviously, the estimated value 
for k corresponds to a bound of a=50%. In this case, the coefficient Xa is 0, and 
we obtain the same estimation as in the previous expression. 



a 5% 10% 15% 20% 50% 
a;c|-1.65 -1.28 -1.03 -0.85 0 

Table 1. Some values for the bound a 



4 Turbo Erasure Correction 

The design of error correction meets the requirements of our split-multipath 
scheme. The Turbo Erasure correction code (TEC) described in this section is 
based on the well know Reed-Solomon error correction code (RSC). 

RSC codes are linear block codes, which are often denoted RS{n, k) with s- 
bit symbols. The encoder takes k data symbols and adds check symbols to make 
an n symbol codeword. RSC codes correct up to t errors in a codeword where 
2t = n — k. For a symbol size s, the maximum codeword length (n) is n = 2s — 1. 
Because a RSC codes correct symbol errors, they can potentially correct many 
bit errors. This makes RSC code very good at correcting large clusters of errors. 
Moreover if the position of the error is known (error which is called erasure), then 
the decoding procedures can correct up to 2t erasures. It means that RSC could 
correct the same number of errors as the redundancy added. In TEC, when a 
data packet arrives, it is divided into k subpackets each with L bits. Then these 
subpackets were put into a two-dimensional array with k x L bit as shown in 
Figure 1. Further let L = L' x s and every s bits form a symbol in finite field 
GF(2®). The encoding could be carried out in two stages. The outer-codes are 
Reed-Solomon codes over GF(2®) which protect against subpacket loses. Each 
column of information symbols in GF(2^) is encoded into a code word of Co{n, k), 
where the number of redundant symbols R = n — k. In total there are L' outer 
code words in this array. Then a header h is added to each row, which keep the 
index of each subpacket and the number of padding added. The inter encoding is 
optional which gives extra reliability over link errors. A binary BCH code could 
be used for each row as an inner correction code. After the TEC encoding, each 
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row of subpacket is sent on n different path established by the multipath routing 
algorithm. As along as more than k of them are received in the destination node 
the TEC is able to reconstruct the original packet. 



-H h h- H = h “H f h“ 




packet 

header 



Fig. 1. Turbo Erasure Correction Code 



The desired characteristics of the TEC are summarized below: 

— BURST CORRECTION: Errors occurs because of link failure. Normally the 
whole subpacket is lost instead of bit errors. 

— ERASURE CORRECTION: The index in the subpacket header help to lo- 
cation the error in the decoding, which resulting in erasures, 

— ADAPTABILITY : The number of multipath degree and link quality channel 
varies over a wide range in a short period of time. TEC can adapt to the 
changes quickly. 



5 Simulation and Results 

We have implemented the simulation of MDR algorithm and try to quantify 
the amount of overhead it introduces versus the improvements obtained. The 
simulations were performed using our own mobility framework designed for the 
OMNeT-b- 1- simulator. 

We have considered 50 nodes randomly distributed inside a rectangle surface 
(500 by 800 units). For the movement of nodes we used the Random Way Point 
algorithm. The average sleep time of a node was chosen 5.5 seconds. We are 
assuming that all the links in the network are bidirectional. Sets of up to 10 sim- 
ulations were performed for different combinations of the average speed of nodes 
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and transmission ranges. Usually, the speeds were considered in the interval 2 - 
20 units/second and the transmission range in the interval 100 - 325 units. One 
of the nodes defined as being the source node, and randomly chose a destination 
each 0.5 seconds to forward a message to it. Each simulation had a limit of 200 
seconds (in fact after 200 seconds the source stopped generating requests and 
the simulations ran until all the messages were exhausted in the network). 

The main parameters considered were the number of messages, the amount 
of traffic generated, the latency introduced and the reliability of the algorithm. 
An implementation of DSR with caching of the paths and the route maintenance 
enabled was also implemented for comparison. We have run both DSR and MDR 
for several network configurations. The parameters were identical for both cases 
and also the generation of destinations. The DSR algorithm had the caching 
of the paths and the route maintenance enabled. The results are presented in 
Figure 2. 




Fig. 2. Comparison MDR/DSR 



The figure shows that the number of overhead messages is higher for the 
MDR algorithm. A closer look at the message sizes shows that the MDR traffic 
compared to the DSR traffic varies from a 4.04:1 to a 1.02:1 ratio (from the lower 
average speed to the higher one). 





Reliable Splitted Multipath Routing for Wireless Sensor Networks 599 



After the paths are created, the source will deliver one data packet that takes 
one round to travel through one hop. In this case, the latency of DSR is smaller 
with low mobility. With the increase of speed, the situation changes. In practice 
we assume data packets far larger than control messages. As future work we are 
going to investigate the latency from this point of view as well. The last graph 
in Figure 2 shows the average number of failed cases for the two algorithms. The 
MDR algorithm performs way better than DSR. The figure shows clearly the 
two objectives of our algorithm: it improves the reliability a lot and it makes the 
network almost immune to higher average speed of the nodes. 

A failed case is the situation in which the source had a data message to 
deliver to the destination but failed reaching it. There are two reasons for it: 

— the route discovery mechanism did not return any valid paths between the 
source and the destination; 

— although there were several paths available, the data packets got lost on the 
way (due to mobility issues). 

The MDR algorithm performs way better than DSR, so this is the advantage 
for which we pay with higher number of control messages and higher latency. 

6 Conclusions 

This paper introduced a splitted multipath scheme to improve the reliability of 
data routing in wireless sensor networks by keeping the traffic at a low level. An 
on-demand multipah routing algorithm offers the data source with several paths 
to any destination (if available). It is used in combination with a data splitting 
method based on Turbo Erasure Coding. 

We have implemented this scheme and estimated the main characteristics. 
It greatly increases the reliability of packet delivery in wireless sensor network, 
while keep the total network traffic much lower than the traditional multipath 
routing. At the same time the latency of splitted mulitpath routing is shorter 
than any retransmission scheme 

The future work will focus on integrating path estimation in the MDR, so that 
the failing probabilities of each node could be obtained in the routing process. 
Also we have in mind modifying the Route Reply phase to better deal with 
failures. This will allow also caching of routes also. The effect of caching the 
routes and maintaining them has still to be determined. Our scheme, although 
focused on WSNs, can be incorporated into any routing scheme to improve 
reliable packet delivery in the face of a dynamic (wireless) environment where 
nodes move and connections break. 
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Abstract. In this paper, we study the reliability issue in aggregating data 
versions for execution of real-time queries in a wireless sensor network in 
which sensor nodes are distributed to monitor the events occurred in the 
environment. We extend the Parallel Data Shipping with Priority Transmission 
(PAST) scheme to be workload sensitive (the new algorithm is called PAST with 
Workload Sensitivity (PAST-WS)) in selecting the coordinator node and the 
paths for transmitting the data from the participating nodes to the coordinator 
node. PAST-WS considers the workload at each relay node to minimize the 
total cost and delay in data transmission. PAST-WS not only reduces the data 
aggregation cost significantly, but also distributes the aggregation workload 
more evenly among the nodes in the system. Both properties are very important 
for extending the lifetime of sensor networks since the energy consumption rate 
of the nodes highly depends on the data transmission workloads. 



1 Introduction 

In this paper, we study the use of in-networking processing approach [MFH03] for 
processing of reaTtime queries which access to sensor databases maintained by 
sensor nodes distributed in the system to generate timely responses if certain events 
are detected or emergency situations occur [SKH03, YG03]. If the communication 
workload is concentrated on some nodes, not only the energy consumption rate of the 
nodes will be heavy, the message loss problem will also be very serious due to high 
collision probability in data transmission. It is quite common that the sampled value 
for a data item may contain errors due to noises. Thus, the result generated from a 
query may contain error too if the data items accessed by the query contain error. In 
[LPSL04], a parallel data shipping scheme, called PAST, is proposed to gather the 
right versions of data items using the time-stamping method for a real-time query at a 
coordinator node so that they are relatively consistent with reduced data transmission 
cost. In this paper, we extend PAST by considering the data communication workload 
among the relay nodes in choosing the path for collecting sensor data for execution of 
a query. Our objective is to satisfy the constraints of the queries and at the same time 
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to minimize the communication overhead and improve the reliability in data 
communication by evenly distribute the communication workload in the system. 



2 System Model 

A wireless sensor system consists of a base station (BS) and a collection of sensor 
nodes distributed in the system environment which is divided into a number of square 
grids with length of r as shown in Figure 1. It is assumed that the nodes within the 
same grid capture the same signals of their surrounding environment. The length r of 
a grid is defined such that a node can directly communicate with all the nodes in its 
neighboring grids. Each sensor node generates sensor data values following a pre- 
defined sampling period which is defined based on the dynamic property of the 
sampled entities. A real-time query F, can formally be defined as a tuple: {D,, Opi, <„ 
Oi, Ai Ri) . Opi is the set of read operations with each operation access to a sensor data 
item (O,). To simplify the discussion, it is assumed that the required data items of a 
query are defined at the grid level. The set of operations Opi in a real-time query is 
associated with precedence constraints (<,) on their execution orders. Due to the 
responsive nature of a real-time query and the dynamic nature of the system 
environment, it is important that the values of the data items accessed by a real-time 
query are representing the current information (“real-time status of the entities”) in 
the environment. Each real-time query has a currency requirement (d,) on its set of 
data items. Failing to meet the requirement implies that they are too “o/tf’ and not 
correctly describing the current situation. Since a timely response is critical to 
important events occurring in the environments, each real-time query is given a 
deadline on its completion time. In addition to meeting the deadline and currency 
requirement, another important issue is the reliability of the results generated from a 
query. As the query result generated from a set of data items may contain errors, it is 
important to provide multiple results by accessing multiple data versions of data items 
in processing a real-time query to improve the reliability and accuracy of the results. 
Therefore, a real-time query is associated with a result interval (F,), which specifies 
the time interval of data items for generating the results. 




Figure 1: System Model. 
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In this paper, we adopt relative consistency as the correct notion for ensuring the 
correctness of the results and meeting the currency requirement of a real-time query 
[LP04, SBLC03]. Each data version of x is assigned a time-stamp at its generation 
time to indicate the start time of the validity of the data version. It will become invalid 
when the next version is generated. We use a time bound, upper valid time (UVT) and 
lower valid time (LVT) to label the validity interval of a data version. The set of data 
items for execution of a real-time query are relatively consistent if they are temporally 
correlated to each other, i.e., representing the status of entities in the environment at 
the same time point. 

Relative consistency. Given a set of data versions V from different data items, the 
versions in V are relatively consistent ifri{F/(v, ) | v,- e F} 7 ^ d) , where F/(v,) = [ 

TFr(v,), C/Fr(v,)]. 

To meeting relative consistency requirement, the deadline constraints and 
currency requirement, for query T,, the validity of all its accessed data versions should 
not be earlier than (Z), - d,), where Z), and d, are the deadline and the currency 
requirement of Z„ respectively, i.e. \ x- e V})V\{D- - A,,Z)J O . The time 

window (Z), to (Z), - d,)) is called the valid time window for the set of valid results of 
the query. 



3 A Parallel Data Shipping Scheme - PAST 

In Parallel Data Shipping with Priority Transmission (EAST) [LPSL04], the 
participating nodes of a query submit data versions to a carefully selected coordinator 
node in a parallel and synchronized fashion. The submission of data versions from the 
participating nodes is synchronized depending on the farthest participating nodes 
from the coordinator node. 

Once the base station receives a real-time query, it will determine: (1) which 
node (grid) should be assigned as the coordinator node such that the total transmission 
cost of the data versions from the participating nodes to the coordinator node is 
minimized, and ( 2 ) which data versions from each participating node should be sent 
to the coordinator node. The data transmission delay from a participating node to the 
coordinator node is measured in terms of the number of hops in communication 
between them as we assume that the data transmission delay for sending a data 

version through one hop is a constant Q. Let Gaii = {gi , gi, gi,, , gn} be the set of 

grids in the system and Fy is defined as the distance (in number of hops) between grid 
i and grid j where i, j ^ 1, ..., n. Suppose Z, wants to access to u grids/nodes and its 

required nodes are in the grids set G, = {gn , ga , go , , g;,«}and u = |G,|. Let F,o,aix 

be the total transmission length defined in terms of hops for choosing grid X as the 
grid where the coordinator node is residing. Then, F,o,aix = z F^ j . Let be the 

J^G, 

maximum data transmission delay of all the participating nodes of a query measured 
in grid using the shortest distance. The set of data versions to be submitted from each 
participating node is those data versions which are valid within the interval from (Z), - 
Ci - D^ax - Ri) to (Di - Ci - Dinax)- The maximum number of hops ZZ, of the 
participating nodes from the coordinator node is then: ZZ, = D„ax ! id. Let the 
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coordinates of a grid k be Y/,) and (//, ) be the set of grids which can be 
reached by the data versions originated from gn with a distance of no more than Hi. 

Eqn. (1) defines a square region with a participating node as the central point of 
the square and the boundary is Hi hop counts from the grid where the participating 
node is residing. Then, we calculate the coordinates of the coordinator node which is 
within the intersect regions of all the participating nodes to minimize the total hop 
counts is getting all the data items from the participating nodes. 

Once the coordinator node and the set of data versions for transmission have been 
determined, the information together with result interval requirement 7?, will be sent to 
the participating nodes. The transmission of data versions from the participating 
nodes to the coordinator node through the relay nodes are prioritized so that the 
arrival time of the data is close to the expected time. The priority of a data message Af, 
for query T, at node Nj is calculated as: (Z), - Current time) / number of hops from Nj 
to the coordinator. A higher priority is assigned to a data message for transmission if 
the calculated value is smaller. The query will be processed at the coordinator node 
according to the order of the operations defined in the query and following the 
relative consistency requirement. 



4 Workload Sensitive Data Aggregation - PAST-WS 

In this section, we introduce the extension of PAST, PAST with Workload Sensitivity 
(PAST-WS), with purpose to improve the reliability and to reduce the cost and delay 
in data transmission from the participating nodes to the coordinator node [IEGH02]. 
Although the total aggregation distance defined in hop counts from the participating 
nodes to the coordinator node is minimize in PAST, it has ignored the data loss 
problem in choosing the coordinator node and the aggregation paths. PAST-WS 
resolves this problem by calculating the mean number of data re -transmissions in 
choosing the coordinator node and the aggregation paths instead of using the physical 
hop counts. Another important benefit of the proposed scheme is that the data 
transmission workload will be more evenly distributed among the nodes in the 
system. Thus, the energy consumption rate of each node will remain similar over the 
network, enabling a longer system lifetime. 

4.1 Error Modeling for Data Aggregation in Sensor Networks 

After determining the coordinator node using PAST, the base station will determine 
the paths and the start times for the participating nodes to submit their data versions. 
Since the grids are in a square shape, the shortest path defined in terms of hop counts 
to the coordinator node can easily be calculated, i.e., it is the shortest line connecting 
the participating node and the coordinator node using a shortest path searching 
algorithm. However, this may not be the best one in terms of number of 
communication messages and total transmission time due to retransmissions. In 
particular, if multiple sensor nodes want to send messages to the same node N at the 
same time (or within its transmission time), the receiver N may not be able to receive 
all of them due to collisions. For the calculation of the error probability of message 
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loss, the base station maintains an array indieating the current relay workload of each 
node in processing real-time queries in the system. The begin time and the end time of 
the activated queries are also recorded. When the end time of a query is expired, the 
array will be updated accordingly. 

We model the probability of message loss at a node P„ as a function of the 
number of nodes (n) concurrently sending data to it. We assume that the transmission 
delay (5) and transmission period (P) are the same for all senders. For the case n = 2, 
i.e. there are two senders to the same receiver, the probability of having a conflict in 
transmission is p2 = SIP. This can be observed in Figure 2. If node 2 sends a message 
during the first conflict time interval (marked grey in Figure 2), the message from 
node 2 may be lost since the receiver is receiving a message from node 1. Within a 
period, if a message from node 1 is sent after the first conflict time interval, there will 
be no loss of node 2 message. So the message loss probability for the case of having 
two senders is S/P. 



Node 1 



h«- 

O 

5 



P 






o 



Node 2 



Figure 2: Data Collisions Probability for two senders. 



In an interval of time P, at most m = [F’/iS'J messages can be sent, i.e. a period 
can be divided into m small time intervals in which only one message can be sent. 
Since there are n (suppose n < m) senders, there are m" number of combinations for 
choosing message transmission intervals for these nodes. Among these m" 

possibilities, there are = m{m -\) ■ ■ ■ -{m - n + \) - — — — 

(m-n)\ 



combinations that will not cause any message loss .Thus the probability of message 
loss from any node is: 

w!/(w -n)! 






m 



That is: 



V l/« 



= 1 - 






m 



eqn. (2). 



• {m - n)!^ 

We can solve equation (2) to get p„ for different values of n. After we do this for 
n = 1, 2, . . ., m, we get the distribution of loss probability. 

To calculate the average transmission cost for a single hop (measured in hop 
counts), we assume that the receiving node is the relay node of k nodes. Then, the loss 
probability of message sending to the receiving node is Pk. So the probability of 
successfully sending data to the receiving node by sending once is 1 - p^ The 
probability that the sender needs to send twice (i.e. the first message sent is lost, and 
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the second one is received) is p^il -pk)- Similarly, the probability that the sender need 
to send K times is (i.e. the first {K - 1) transmissions are all lost, and the final one is 

received) is p{k) = Pk^^^~Pk) ■ So the expected message cost of sending one 
message through a single hop to the receiving node is: 

Chop = TK-p{k)= TK-pf-^il-p,) eqn.(3). 

K=\ K=\ 

The cost of a path Cpath is the sum of the costs of all the hops in the path. When 
the probability of message loss is considered in calculating the propagation cost, the 
message transmission delay of a path is no longer a constant proportional to number 
of hop counts. It is a random variable and larger than that of no message loss case. We 
can estimate the average number of times a message that has to be sent in one single 
hop. In order to calculate the mean delay of a path, we need to estimate the mean time 
length of intervals between consecutive message resend events. The probability 
distribution of the number of successive message loss is the same as the distribution 
of the number of resends, i.e. p{K) discussed in eqn. (2). Thus the average time length 
between two consecutive retransmissions is: 

YK-P-p{k)^P-C,^^ eqn. (4) 

K=\ 

We get the expected delay Dhop of a single hop by multiplying it with the average 
number of retransmissions Chop, i-S- Dhop=P- C^^p . 

4.2 Calculating the Aggregation Path and Coordinator Node 

In choosing the path for data propagation, we need to ensure that the expected delay 
satisfies the currency requirement such that (Z), - C, - Dhop) > 0. Algorithm 1 shows 
the steps of finding the coordinator node and the best path to forward the data 
versions to the coordinator node. If message loss is considered, the delay is larger 
than that of no message loss. The set of possible coordinators under the case of 
message loss is a subset of that of the case with no message loss. In this way, we 
exclude most of the impossible candidates for the coordinator node. Assuming a 
straight path (the shortest connection path between a participating node and the 
possible coordinator node), we find a coordinator node satisfying the currency 
requirement with the minimum cost. Finally, we find a feasible replacement of the 
maximum for each path; and for each replacement, we calculate the reduction in cost. 
We choose the replacement with the maximum cost reduction. The final step is 
repeated until there is no feasible replacement. 



Objective: To find the coordinator node and the paths from the participating nodes of T,- 
with total minimum communication cost. 

Inputs: The node status of all the participating nodes: n (number of receivers), S (mean 
message delay to send a data version) andF (mean data transmission period); Gt = {Gn, 

Ga, Ga, , 

Outputs: The coordinator node and the set of paths from the participating nodes with 
minimum communication cost. 

Call Algorithm 1 (PAST) to find the set of possible coordinator S\ 
for each coordinator node c node in S 
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{ /* exclude non-candidate coordinators from S */ 
for each participating node pjiode in G, { 

path = the straight path from cjiode to p_node; 

D- TDhop(H); 

He^path 

if(Z)>4-Q-i?,)then { 

S = S- {c_node) ; /* cannot be a coordinator */ 
break; /* break and continue to check the next c_node */} 

} 

} 

if( S ==<!)) then abort; /* no feasible solution */ 

Cmi„ = infinity; 
for each node c_node in S do 

{ /* assuming a straight path (the shortest path), find the coordinator node with 
minimum cost */ 

for each participating node p_node in Gi 

{ path = the straight path from cjiode to p_node; 

Cpath = Sum of Ci,op of each hop of the path-, 

Ctotal Ctotal T CpQthi} 

if (Ctotal < Cmin) { 

coordinator _node = c_node', 

^mift ^ lotah} 

} 

Spath = the set of straight path (the shortest path) from participating nodes to 
coordinator _node', 

do 

{ /* adjust the paths */ 

r =0- 

'-'max ^ » 

Rpath - NULL; /* path to be replaced */ 

Rmax = NULL; /* path which will replace Rpan, */ 

F= false; 

for each path in Sp^th 

{Q = 0; 

for each replacement r of path { 

if(r satisfies delay constraint AND Cpath -Cr + IxCiaarease ~ 

^^increase ^ C]^ ){ 

Cp Cpath Cy ^ ^C^aarease ^Cpicreaset ! COSt rcduCtion / 

R = r,} 

} 

if(Q>0){ 

F = true; 

if(Q > C„,ax) 

{ Rmax = R\ 

Cmax Cp, 

Rpath =path-,} 

} 

} 

^path ^ Rpath T }Rmax\ ~ }Rpath\ •> replace the path Rpath with Rmax 

} wbile(R’ == true); 

return S„g,h, coordinator node, 



Algorithm f: Finding the coordinator node and the path loss. 
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5 Performance Results 

Figures 3 through 6 show the results when we vary the size of a real-time query. As 
shown in Figure 3, inereasing the query size (number of grids), the data transmission 
workload will be inereased. Comparing with PAST, the data transmission workload of 
PAST-WS is eonsistently lower as shown in Figures 3 and 4. Figure 5 and Figure 6 
show the distribution of data transmission workload of the nodes in the system. It ean 
be seen that the workload is more evenly distributed in PAST-WS than in PAST. The 
numbers of heavy and medium loaded grids in PAST-WS are smaller than in PAST. 
In addition, we have measured the mean value and varianee in workload of the nodes. 
Consistent with the results in Figures 5 and 6, both the mean and varianee of PAST- 
WS are smaller than that of PAST. 




Figure 3: Query Size Vs. Data transmission 
cost 




26 27 28 29 30 31 32 33 34 35 36 

query size (number of grids) 

Figure 4: Percentage improvement of 
PAST-WS 



Figures 7 and 8 show the results of PAST-WS and PAST respectively when we 
vary the currency requirement of a query. We can see that PAST-WS only not gives a 
smaller transmission cost, it can complete more queries successfully, i.e., meeting the 
deadline, currency and result requirements. In PAST, due to long aggregation time 
and heavy workload as a result of re-transmissions, a large number of queries can 
only be partially completed and some of them are even failed, i.e., no results are 
generated, especially when the currency requirement is tight. The situation is less 
serious in PAST-WS as shown in Figure 7 as its data transmission workload is lower 
after considering the workloads of the relay nodes in choosing the coordinator node 
and the relay nodes. We also have investigated the impact of varying the locality 
factor of a query to their performance. (Due to space limitation, we do not show the 
result figures.) Similar to the results discussed before, PAST-WS shows a better 
performance. 



6 Conclusions 

In this paper, we have studied how to improve the reliability in data aggregation for 
execution of real-time queries in a wireless sensor system. The real-time queries are 
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Figure 5: Distribution of transmission Figure 6: Distribution of transmission 

workload (PAST-WS) workload (PAST) 



associated with a deadline on their completion times and it is important to generate 
the results before the deadlines since it is mainly for generating responses to the 
events occurred in the system. To meet the query processing requirements with 
minimum data transmission cost, a parallel execution scheme, called PAST was 
proposed. However, the workload at the relay nodes was not taken into consideration 
in selecting the coordinator node and the aggregation paths. If the workload at the 
relay nodes is heavy, the data loss probability will be high and the consequence is 
either some data are lost or a lot of re-transmissions are required. A lot of re- 
transmission not only increases the energy consumption rate at the relay nodes, but 
also increases the data transmission workload in the system and the delay in gathering 
the data versions for processing of the queries. In this paper, we extend the PAST to 
include a workload sensitive scheme in selecting the coordinator node and the paths 
for data aggregation. The new algorithm is called PAST-WS. Simulation results have 
shown that PAST-WS can significantly reduce the aggregation workload and delay 
and at the same time can distribute the aggregation workload evenly in the system. 




Figure 7: Currency Vs. Completed query Figure 8: Currency Vs. Completed query 
percentage(PAST-WS) percentage (PAST) 
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Abstract. As the logistic environment of digital contents is rapidly 
changing, the protection of digital rights for digital content has been 
recognized as a very critical issue that needs to be dealt with effectively. 
Digital Rights Management (DRM) has taken mnch interest in Inter- 
net Service Providers (ISPs), authors and publishers of digital content 
in order to create a trusted environment for access and use of digital 
resources. In this paper, PKI (Public Key Infrastructure) and a licens- 
ing agent are used in order to prevent illegal use of digital contents by 
unauthorized users. In addition, a DRM system is proposed and designed 
which performs proprietary encryption and real-time decoding using the 
I-frame-under-container method to protect copyright of video data. 



1 Introduction 

As the distribution environment for digital resources undergoes rapid changes 
resulting from the proliferation of the Internet and increased interconnection 
among computers, the demand for multimedia content such as music, images, 
movies, publications etc. in digital form is rapidly increasing. Since such dig- 
ital content can be duplicated without deterioration in quality, the protection 
of digital copyrights for preventing such unauthorized duplication is emerging 
as an important issue. For content protection and management, information 
protection, technology for providing stability and security and digital copyright 
management DRM (Digital Rights Management) technology for management 
of copyrights and the monitoring/tracing of overall distribution of contents is 
necessary [1,2]. DRM can be defined as a management technology which contin- 
uously protects and manages the rights and interests of copyright holders [3,4]. 
A comprehensive measure for protecting copyrights from attempted copyright 
infringement against digital contents is being pursued by utilizing DRM tech- 
nology, and various researches are being carried out to create a trusted environ- 
ment within which creation, distribution and use of copyrighted media are being 
performed [5,6]. Several companies such as InterTrust, ContentGuard etc. are 
offering various types of DRM solutions. However, in existing DRM technology, 
static copyright management is performed by inserting protection conditions. 
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authoring management, etc., into the contents; therefore, due to limitations in 
monitoring and tracing functionality, not only is dynamic control of copyright 
difficult to achieve, but there is also difficulty in obtaining proof of illegal conduct 
should copyright infringement (such as illegal copying) occur. As such, a digital 
copyright management technology which is applicable in all types of online and 
offline content and enables dynamic copyright management, as well as real-time 
monitoring and tracing, must be developed [7]. 

In this paper, an integrated DRM system is proposed and designed which pro- 
vides user certification for multimedia contents in online and offline conditions 
by using PKI and a licensing agent, and prevents illegal use by unauthorized 
users through encryption of the data itself. 

2 Related Works 

Existing DRM technology does not take privacy protection into consideration 
since the protection of user’s privacy is not directly necessary for copyright pro- 
tection. Due to this, user information leaked during the process of user certifi- 
cation for issuing licenses reported usage details for monitoring illegal usage of 
contents; therefore, problems related to user privacy infringements occurred [8]. 
Microsoft’s WMRM (Windows Media Rights Manager) is an end-to-end DRM 
system which distributes digital media files to content providers and consumers 
in a secure manner [9,10]. WMRM distributes media, such as music or video, to 
content providers through the Internet in a protected form through the encryp- 
tion of files. In WMRM, each server or client instance receives a key pair through 
the individualization process; instances which are considered cracked or unsafe 
are excluded from service through the certificate cancellation list. WMRM is 
widely used in incorporated form with the Windows Media Player; however, it 
only supports a limited number of file formats since its flexibility in dynamic 
environments is limited-it only supports Windows Media Player. In addition, 
one disadvantage of the WMRM is since its user certification process for issuing 
licenses does not use any specific protection mechanism, user information such 
as user IDs or e-mail addresses are leaked. 

3 System Architecture 

Data protection of original content and authentication should not be imple- 
mented by simple access right control on existing content or password-based au- 
thentication but by user authentication using PKI and through inserting related 
information into original content using data encryption. The proposed system 
has a client/server architecture and its overall layout is illustrated in Fig. 1. 

When a content is registered on the server using an external interface, pro- 
cessing for content monitoring is performed by an agent module and encryption 
is performed on the content. In order to use the content, user authentication 
is performed by the licensing agent sent from the server: for authorized users, 
the content is executed by the application program, and for unauthorized users. 
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Fig. 1. System Architecture 



a warning message is output. Real-time monitoring against illegal activities is 
performed on the content by the licensing agent, and all illegal activities of all 
users are stored on the server’s database through a monitoring interface. 



4 Authentication and Encipherment Mechanism 

4.1 Encryption and Decryption of Video Data 

When the server which receives a request for use of content performs user au- 
thentication, the symmetric key needed for decoding is encrypted using the user’s 
public key then sent to the client so that the video data’s I frame which is en- 
crypted can be decrypted for the client for playback. Figure 2 illustrates the 
encryption and decryption process of the video data which is the content used 
in this system. 

The video data stored on the server is generated by extracting the I frame of 
each video content then applying encryption using a symmetric key. Any user 
can download the server’s video data; however, the data cannot be used with- 
out proper authorization since the I frame is encrypted. The symmetric key 
algorithm is used since it can minimize the time required for encryption and de- 
cryption. If the downloaded video data is to be played back to the client, the user 
issues a use request to the server, which then performs authentication depending 
on whether the user is an authorized one. In this authentication process, a PKI 
algorithm is used. The symmetric key of the video data requested is encrypted 
using the user’s public key and then sent to the client. 

The client’s agent decrypts the symmetric keys using the user’s private keys, 
then extracts the I frame of the video data to be played back by using these 
symmetric keys, then stores it on the buffer together with the B and P frames 
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Fig. 2. Encryption & Decryption Processing 



to perform playback. While all the video data is being played back, the delayed 
frames should be calculated to determine the initial buffer size. 



4.2 License Certification Method 

The author of the content sends the created content to a content publisher. Then, 
the content publisher encrypts (E) the content using an arbitrary symmetrical 
key Ks to generate the encrypted content C, which is then sent to the content 
provider to be stored on the content provider’s server. 

C = EKs[data] 

The user can download a desired content from the content provider’s server. 
However, the user cannot execute the downloaded content arbitrarily because it 
is encrypted. 



Step 1. User Registration Protocol 

The user has to register first in order to use the content. The user registration 
process is shown in Fig. 3. 

The user connects to a system server which functions as a license server and sends 
the user’s certificate cert_u. The system server verifies the user’s certificate cert_u 
through its specified certification path; if the certificate is correct, it sends the 
user’s agent. 
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Fig. 3. User Registration Protocol 



Step 2. License Issuing Protocol 

The user installs a license agent (LA) program which is then executed. When 
the user executes an encrypted content, the license agent installed on the user’s 
PC connects to the system server and obtains a license, as shown in Fig. 4. 




License Agent (Client) 



License Request 



EKuJEkpj^licetvse)l 




Fig. 4. License Issuance Protocol 



The license agent connects to the system server and requests a license for the 
desired content. The system server issues a license including the license ID, user 
ID, content ID and privileges. Here, the license is encrypted using the user’s 
public key for security reasons (as shown below), and signed using the user’s 
private key and then transmitted. 

E Kuu[EkpL / C (license)] 

Here, ku is the public key while kp is the private key. Therefore, kuu is the 
user’s public key and kpL/C is the private key of the license clearing house 
(L/C). 

Step 3. License Certification Protocol 

When the user executes an encrypted content, the license agent checks whether 
there is a license present. If there is no license present, a license is requested 
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according to Step 2 above; if there is a license present, the authentication for 
that license is requested to the license clearing house which resides on the system 
server, as illustrated in Fig. 5 below. 



a 



License Agent (Client) 



EkUj^JLicenceJ 



ElaiJko’J 




Fig. 5. License certification Protocol 



When the license clearing house receives an authentication request for the 
license from the license agent, it checks the privileges from the license information 
list and then performs authentication. If the user’s license is valid for the time 
up to a specific date, it checks whether that time has expired; if it is a license 
for the allowed number of usage, the number is decremented by 1. Then, after 
performing operations on the key value using the user’s ID (as shown below), 
encryption is done using the user’s public key then the key is sent to the user. 

Ekuu[key'] {where key' = ks (B userJD) 

The user agent, which has received the encrypted key, decrypts that key by 
using the user’s private key to extract key, which is then calculated with the 
user JD in the user’s license to get the key needed for decrypting the encrypted 
content, which is then shown to the user. 

5 Performance Evaluation 

To evaluate the performance of the system proposed in this paper, we imple- 
mented a prototype version of the system using Visual C-|— I- 6.0 and MS-SQL 
2000. The encryption time for the video data and the initial playback delay time 
due to decoding time were both measured at a PC that have Intel(R) P-IV, 
CPU 2.4GHz and 512M RAM. The result of comparison between the conven- 
tional method of playing back an already encrypted video data file after decoding 
(non-realtime decoding method), and the method of playing back while decoding 
in real time, which is proposed in this paper, is shown in Fig. 6. To measure time 
accurately, a total of 30 video data files were segmented in minute units for the 
comparison. As the test result shows, in the conventional method of playback 
after decoding the entire video data file, the larger the file size, the longer the 
initial delay time for playback; whereas in the proposed method, it has been 
shown that the initial delay time has been reduced significantly. 

The result shows that, in the proposed method, the delay time until the start 
time of video data playback (including decoding time) is much shorter than that 
of the conventional method. In addition, even with real-time decoding, stable 
playback was demonstrated without interruption of playback or noise. 
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Fig. 6. Encryption Time and Delay Time Comparison 



6 Conclusion 

In this paper, a DRM system for digital copyright protection using a licensing 
agent which is based on PKI has been proposed and designed. The licensing 
agent performs user authentication using PKI methodology, performs encryption 
of the data itself at the system server using container methodology, and decoding 
is performed in real time for the client under copyright protection of multimedia 
data. As a follow-up, complete implementation of the proposed system and a 
safety evaluation of the user authentication process are necessary. 
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Abstract. This paper proposes a handoff management scheme for the synchro- 
nization algorithm of an all-IP based multimedia system. The synchronization 
algorithm and handoff management method are proposed to realize smooth 
play-out of a multimedia stream with minimum loss under handoff conditions 
which normally occur due to the movement of mobile hosts. Handoffs, which 
frequently occur under mobile environments, result in loss of multimedia 
streams stored in base stations due to the change of base station. As a result, the 
multimedia stream shows a low QoS due to disruption of the stream at play-out. 
The proposed scheme shows that it not only provides a stream of continuous 
play-out but also shows a higher packet play-out rate and lower loss rate than 
previous methods. 



1. Introduction 

The recent explosive increase in Internet utilization has accelerated the emergence of 
new services based on the mobile computing environment. As a result, the existing 
service architecture, which is based on simple communication between the client and 
the server, is no longer capable of satisfying the needs of users who want to utilize 
various types of high-speed multimedia services [1,2]. This limitation of existing 
communication service architecture can be overcome by expanding the concept of 
service provision from one that is single system based to a mobile system connected 
by a wireless network: the mobile environment provides key functionalities which 
enable such performance expansion. In a mobile environment, play-out of multimedia 
data on a mobile host is difficult due to inherent characteristics in wireless networks 
such as the high data loss rate, long delays and low network bandwidth. The distrib- 
uted multimedia systems which are connected to the wireless network in large num- 
bers use buffers to overcome network delays and unpredictable losses. 

A Base Station (BS) transmits Group of Picture (GoPs) of MPEG from multiple mul- 
timedia servers. However, if the expected play-out time is faster than the arrival time, 
unpredicted delays and increase in traffic can cause failures in play-out of subframes. 
To solve this problem, buffering is needed at BS in order to reduce interpacket jitter 
delay between the multimedia server and BS in a mobile environment. A mobile 
network offers the advantage that Mobile Hosts (MH) can move within the network 
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[3,4,6]. However, the disadvantage is that multimedia streams transmitted to a BS 
must be retransmitted. The disadvantage of handoff is that the MPEG video data 
stored in a BS buffer ean be lost. In this paper, the BS and the MH were eonfigured 
with a 2-jitter buffer and 1 -jitter buffer, respeetively. In addition, for the GOP (I, P, 
B) of the MPEG video generated and lost during handoff, the I-pieture of the MPEG 
video of the old BS is transmitted to the new BS. The proposed method aims to en- 
able play-out of the substream stored in the buffer of new BS without retransmission 
so that adverse effeets on play-out that result from loss of media are prevented. 



2, Related Work 

Current researeh has reaehed a level, in whieh synehronization sehemes based on 
wireless eommunieations are ineorporated into eonventional ones. D. K. Y. Yau and 
S.S. Lam adopted a frame rate adjustment seheme in whieh the CPU proeessing time 
of a video server is adjusted aeeording to the frame transmission rate, and traffie load 
on network links is monitored using kernel-level threads. This leads to a reduetion in 
network traffie by inereasing or deereasing frame transmission rates. However, one 
drawbaek of this approaeh is that eontrol is performed only on the server side [4]. 

M. Woo, N. U. Qazi, and A. Ghafoor defined a BS as an interfaee between 
wired/wireless networks. For wired networks, the interfaee defines buffering in the 
BS to reduee jitter delay between paekets. One possible shorteoming of this approaeh 
is an attempt to apply buffering to synehronization by assigning existing wireless 
eommunieation ehannels [5]. 

Azzedine Boukerehe proposed an effieient distributed synehronization problem in 
wireless and mobile multimedia systems to ensure and faeilitate mobile elient aeeess 
to multimedia objeets [1]. He also proposed synehronization and handoff manage- 
ment sehemes that allow mobile hosts to reeeive time-dependant multimedia streams 
without delivery interruption while moving from one eell to another [2]. But in this 
method, delay and overhead due to forwarding of MMU data from the buffer of an 
old BS to a new BS adversely affeets the play-out of MMU in MH. 



3, Multimedia Synchronization for Handoff in MPEG GOP 

3.1 System Architecture 

This system supports k multimedia server nodes, m BS, and n MHs. The BS eommu- 
nieates with ith MH in the mth eell. These MHs aeeess the server via the BS. This 
system monitors variations in the start time for transmission as well as buffers at the 
BS, using variables sueh as arrival time of subframes transmitted from multimedia 
servers and delay jitter. Some advantages of this system inelude its eapability to over- 
eome the limitations of mobile eommunieations like small memory size and low 
bandwidths. The MPEG video data stored in the multimedia server are eomposed of 
data streams split into goP^ units in the sequenee, based on a synehronization group. 
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not on the same byte. Figure 1 shows the configuration for the proposed system. 
Handoff refers to a state in which a MF[ moves out of range of a BS and comes into 
range of another. If handoff takes place during MPEG video transmission, data being 
transmitted from the BS buffer are lost. This system has architecture which passes 
only I-frames specific to MPEG videos into buffer of new BS for the transmission 
and the play-out. Figure 1 shows an overview of system modules for synchronization 
of MPEG videos in mobile environments. MPEG video streams are retrieved from a 
frame of a certain type stored within the database in the multimedia server, and the 
sender transmits them to the MH via the BS. The MH receives bit streams in a frame 
of a certain type from the multimedia server. The frames stored in the buffer are con- 
verted by the MPEG decoder, and are then played out. 



Lsnd- Based Network Mobile Network 




Fig. 1. Synchronization System for MPEG Video 



3.2 Control of Multimedia Server’s Transfer Rate 

There are MPEG frame in 3 types: I, P and B frames. During the decoding process, I 
and P frames are considered more important than B frames. If the GoP loss rate in- 
creases due to overload, the server’s transfer rate can be reduced by selectively dis- 
carding B frames to reduce the loss rate of I, P frames so that the reception rate of I, P 
frames increases. Therefore, the picture quality is better than if no transmission rate 
control is applied on the server. If, at time t, time is insufficient when is imminent, 

GoP is discarded. That is, if the number of possible transmissions r, is 0, GoP is 

discarded. Here, the server’s transmission rate is controlled by selective discarding of 
GoP according to the loss rate of transmitted GoP. Taking this into consideration, a 
method for controlling the server’s transfer rate according to the GOP loss rate within 
the network has been proposed in figure 2. 
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Procedure Control of Server’s Transmission Rate 

While (wait until message of ^ is reeeived from BS) 

Wait //deerement 

elseif(7;^</2< 

Transmit/maintain GoP(I,P,B)[ 
else 

While (wait until message of is reeeived from BS) 

transmit GoP{I)[ 1 1 inerement 

Fig. 2. Control of server’s transmission rate according to GoP loss rate 



3.3 Controlling the Play-Out Time of the MH 

In the method proposed by Chen [4], the server’s transmission time is reeonfigured 
using feedbaek information from MH; here, a new transmission time is set. However, 
this involves a eomplieated method of setting the new transmission time, and is not 
suitable for an Mobile environment where there is no guaranteed limit for loss and 
delay. In seleeting a eontrol for delay variation under sueh a network environment, 
resetting the estimated play-out time of MH is more desirable than eontrolling the 
server’s transmission time. The proposed algorithm for eontrolling the estimated 
play-out time aeeording to the delay variation of the synehronization method within 
the media is illustrated in figure 3 . After reeeiving RTT GoP, paekets are separated 
and a two-stage synehronization proeess within the media is applied. In the first stage. 

Procedure Controlling the estimated play-out time 
While (not EOF) { 

Get a new GoP; 

D=SCR-STR; 

if(D> T p){ Insert into deeoding buffer;} 

If(D > 0) { Insert into Delay buffer; } 

Else} Discard;} } 

< D < // D is average delay time 

continue; 

else if ( £) <= ) { 

P^ = P^ - S , II is interval time, S is max jitter time 
continue; } 

Else{P^= P^+ S 

continue; } 



Fig. 3. Controlling the estimated play-out time 
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the amount of GoP delay occurring within the mobile network is calculated. Since the 
server transmits packets based on SCR, the received SCR is the time of the server’s 
packet transmission, and the STC of MH at reception point is the packet’s arrival time. 
Therefore, the delay can be calculated in the difference between the received packet’s 
SCR value and the STC value of MH at arrival time. The second stage is the process 
of reconfiguring the estimated play-out time of MH according to network delay: it 
consists of estimated play-out time delay action and estimated play-out time progres- 
sion action. The average delay D is calculated by passing the delay D experienced 
by each packet through a low-pass filter 



3.4 Handoff Control 

This paper is intended to enable a MH to implement the play-out of media streams 
within QoS limits, without any loss of multimedia data or additional delays accompa- 
nying message transmission during a handoff In this paper, the HM. provides mul- 
timedia service via the bS^ ■ The BS is classified into a primary BS and non-primary 

BS. The former is a BS responsible for sending MPEG video data in GOP units to the 
MS, while the latter is a BS adjoining the primary BS. 

The buffer size within BS has the two-jitter size, while the one within MH has the 
one -jitter size. Such buffer configurations offer the advantage of disallowing loss of 
MPEG video during a handoff, unlike the configuration scheme where the buffer 
works only within BS. This configuration strategy also offers complementary con- 
figurations for MH’s small memory size. Handoff occurs when MH travels out of 

range of into the range of bs ■ BS reports a handoff message to the 

multimedia server and starts handoff processing. The algorithm is described as fol- 
lows. 

Algorithm Handoff Control 

1- MH. performs playout for a time equal to (GoP x /) x A regardless of handoff 
Here, / is the play-out speed time required by the substream and A is the maxi- 
mum jitter time within the media. 

2. MH. notifies the Handoff message indicating handoff has occurred to the mul- 
timedia server and bs 

^current 

3. Setting performed: BS ,, = BS and BS = BS 

® -r ^^old ^^curreni ^current new 

4. B^^^ sends message Handoff to multimedia server and notifies that the multi- 

media server to stops transmission of GOP. 

5. Each multimedia server transmits G0P{1)[ to BS^^^ as soon as it receives 

Handoff ■ 

6. Only the GOP(I)[ of the MPEG video which exists in the buffer of bS„,^ is for- 
warded to the buffer of bs^^. ■ 
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7. After it has received GOPl{I) from the buffer of BS^,^ through the server, it sends 
Handoff u message to the multimedia server and mobile host to notify that handoff 
has been completed successfully. 

8. sends request messages to each server to perform a normal transfer. 

4, Performance Evaluation 

Presented are the simulation experiments we carried out to evaluate the performance 
of our synchronization scheme. We developed a distributed discrete-event model to 
simulate a cellar wireless multimedia system. Simulation is performed using IBM 
compatible PCs with Pentium or an equivalent processor. Interfaces and algorithms 
were written using the JAVA development kit JDK 1.3, which were stored in Micro- 
soft MDB as simulation.mdb files. For the purpose of this paper, we have assumed 
that all simulations are performed in mobile environments. In order to ensure proper 
packet processing, we applied the information used for actual simulations, computed 
using the Poisson distribution, equally to mobile networks which have 300 channel 
for 60 cells. One thousand GOP frames were used in performance evaluation experi- 
ments to which a maximum delay jitter time of 600ms was applied. Table 1 displays 
the simulation parameters we used in our experiments. 



Table 1. Simulation parameters 



Number of cells 


60 


Forward time to BS 


100ms 


Number of servers 


4 


Rate of handoff 


5% of 
MMUs 


Buffer size of a MH 


1 Jitter 


RTT to request/deliver a GOP 


100ms 


Buffer size of a BS 


2 Jitter 


Average Jitter 


200ms 


Play-out times/GOP 


250ms 


Maximum Jitter 


600ms 



Figure 5 shows the GoP loss rate of MH. Here we can see that, if control is performed, 
the number of transmitted GoPs is reduced if the loss rate is increased so that the loss 
rate is reduced. 

Figure 5 shows the loss rate of GOP at MH. Here, the GoP from 15 to 20 seconds 
is subject to handoff, therefore the buffer data of is sent to the buffer of 

in order to prevent the loss of MPEG video’s I-frame as a result of handoff In addi- 
tion, for handoff, by delaying 10 ms of maximum delay jitter X at the I-picture in- 
terval time T of GOPl at mh. , a method for obtaining more time for moving data 
at Bs to the buffer of bs has been proposed. 

As shown in Figure 5, previous research results failed to deal with handoff and as 
such suffer from loss of substream data in the buffer of BS^i^ . 
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Time(sec) 



Fig. 5. Comparisons of GoP loss rate in MH 




Fig. 6. Comparisons of play-out rate 



Figure 6 shows the results of 31 tests in which the arrival time was changed for 
each experiment. In Test 2 and 8, minimum delay and maximum delay were adjusted 
to 50ms and 600ms each; compared with the no control case, the play-out rate was 
improved by 6% for the Chen scheme, by 8% for the Azzedin-Boukerche scheme and 
by about 10% for the scheme proposed in this thesis. 

In Test 4, 9 and 26, minimum delay and maximum delay were adjusted to 20ms 
and 800ms, respectively, to induce overflow and starvation. In these cases, while 
network traffic conditions worsened, the play-out rate was improved by 5% for the 
first variance and by 8-9% for the second variance. In average, the play-out rate was 
about 79% for the conventional method, 85% for the first variance and 91% for the 
second variance. 



5, Conclusions 

We proposed a scheme that uses MPEG-frame within GOP in mobile networks. In 
addition, delay and overhead due to forwarding of MMU data from the buffer of an 
old BS to a new BS adversely effects play-out of MMU in MH. Therefore, to solve 
this problem, we propose an algorithm which forwards only the I-frame of MPEG- 
frame to the new BS. The MPEG-frames within the buffer of MH are presented as 
adding maximum delay jitter to the interval time of GOP because the play-out policy 
is related to the occupancy rate of the buffer during handoff Therefore, we propose 
an algorithm that performs each overflow policy and underflow policy based on 
buffer watermarking. Moreover, we propose a scheme that not only deals with hand- 
off quickly by controlling the buffer and play-out policy, but also properly handles 
limiting factors for mobile communications such as small memory size and low 
bandwidths. The proposed scheme allows the BSs to control handoff and the play-out 
policy, and further provides a solution to the problems of handoff in mobile networks. 
After evaluation, it has been shown that the proposed scheme offers continuous 
MPEG data play-out, higher packet play-out rates, and lower packet loss rates, rela- 
tive to the conventional scheme. 
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Abstract. In this study, we propose a novel mobile tracking scheme which 
utilizes the fuzzy-based decision making with the consideration of the 
information such as previous location, moving direction and distance to the 
base station as well as received signal strength, thereby resulting to the 
estimation performance even much better than the previous schemes. Our 
scheme divides a cell into many blocks based on the signal strength and then 
estimate in stepwise the optimal block where a mobile locates using Multi- 
Criteria Decision Making (MCDM). Through numerical results, we show that 
our proposed mobile tracking method provides a better performance than the 
conventional method using the received signal strength. 



1 Introduction 

In the microcell-based or picocell-based mobile communication network frequent 
movements of a mobile terminal or host bring about excessive traffics into the 
network and may degrade the quality of services (QoS) severely. If its location can be 
estimated, network resources may be more effectively allocated and better QoS can be 
provisioned with the combination of handoff optimization. Moreover the location 
estimation technology may be used in other new applications such as the emergency 
call for disaster recovery. It will have viable roles in the communication networks of 
next generation. Global Positioning System (GPS) was initially developed for military 
purposes but it is also utilized for civil applications such as local traffic information 
services and geo-location based applications. However incorporating GPS receivers 
into handsets raises questions of cost, size and power consumption [1]. 

Other methods for location estimation are based on radio signal propagation such as 
signposts, dead reckoning, circular or hyperbolic trilateration systems, etc. Many 
methods and systems have been proposed based on radio signal strength measurement 
of a mobile object's transmitter by a set of base stations [2, 3, 4]. Recently, adaptive 
schemes based on the use of cellular systems and on fuzzy logic [5], hidden Markov 
models [6, 7] and pattern recognition methods [8] have been used to estimate the 
position of mobiles. The system studied in [2] estimates mobile location using 
information on contours but it does not provide a realistic search procedure. In [3] the 
estimation is based on the signal strength received at a multi-beam antenna of a base 
station in the multi-path environment, and the angle of its arrival (AOA). AOA is 
measured under the assumption that the signal is in line of sight (LOS), but LOS 
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signal may not be received in the microcell where reflections and diffractions occur 
due to dense building environment. In this situation AOA of the strongest reflected 
signal is utilized for estimation, and therefore the location estimated differs greatly 
from real one. Time of arrival (TOA) of a signal from a mobile to neighboring base 
stations are used in [9], but this scheme has two problems. First, an accurate 
synchronization is essential between all sending endpoints and all receiving ones in 
the system. An error of 1 ps in synchronization results to 300 m error in location. 
Secondly this scheme is not suitable for the microcellular environment because it also 
assumes LOS environment. Time difference of arrival (TDOA) of signals from two 
base stations is considered in [10]. TOA scheme and TDOA scheme have been 
studied for IS-95B where PN code of CDMA system can be used for the location 
estimation. Enhanced Observed Time Difference (E-OTD) is a TDOA positioning 
method based on OTD feature already existing in GSM. The mobile measures arrival 
time of signals from three or more cell sites in a network. In this method the position 
of mobile is determined by trilateration [11]. E-OTD, which relies upon the visibility 
of at least three cell sites to calculate it, is not a good solution for rural areas where 
cell-site separation is large. However, it promises to work well in areas of high cell- 
site density and indoors. 

The above-mentioned schemes such as AOA, TOA and TDOA have problems as 
follows. 

These schemes assume that the cellular system consists of LOS areas. They get 
good results only under this assumption. 

The microcellular system such as IMT-2000 has NLOS areas which are 
affected by specific reflections and diffractions. In this situation these schemes 
have great errors in estimation. 

In the microcellular environment the points of the same average signal strength 
form not a circular contour but a distorted one. These schemes ignore the fact that 
the propagation rule is affected by many parameters. 

They rely only on the information related to radio signal such as signal strength. 
Their accuracies are affected by short-term fading, shadowing or diffraction. 

In this study, to enhance estimation accuracy, we propose a scheme based on Multi- 
Criteria Decision Making (MCDM) which considers multiple parameters: the signal 
strength, the distance between the base station and mobile, the moving direction, and 
the previous location. This process is based on three step location estimations which 
can determine the mobile position by gradually reducing the area of the mobile 
position [12]. Using MCDM, the estimator first estimates the locating sector in the 
sector estimation step, then estimates the locating zone in the zone estimation step, 
and then finally estimates the locating block in the block estimate step. 



2 Estimation Procedure 

Figure 1 shows how our scheme divides a cell into many blocks based on the signal 
strength and then estimates the optimal block stepwise where the mobile is located 
using MCDM. 
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Fig. 1. Sector, Zone and Block 



The location of a mobile within a cell can be defined by dividing each cell into sectors, 
zones and blocks and relating these to the signal level received by it at that point. It is 
done automatically in three phases of sector definition, zone definition and block 
definition. Then the location definition block is constructed with these results. They 
are performed at the system initialization before executing the location estimation. 
The sector definition phase divides a cell into sectors, and assigns a sector number to 
blocks belonging to each sector. The zone definition phase divides each sector into 
zones, and assigns a zone number to blocks belonging to each zone. The block 
definition phase assigns a block number to each block. In order to indicate the 
location of each block within a cell, 2-dimensional vector (d, a) is assigned to each 
block. After the completion of this phase each block has a set of block information. 

The collection of block information is called the block object. The block object 
contains the following information: the sector number, the zone number, the block 
number, the vector data (d, a), the maximum and the minimum value of average PSS 
for the LOS block, the compensated value for the NLOS block and a bit for indicating 
“node” or “edge”, etc. 

Using MCDM and the block object which is constructed as described above, the 
estimator is started with a timer, and the estimation is performed sequentially in three 
steps: sector estimation, zone estimation, and finally block estimation. 



3 Mobile Tracking Based on MCDM 

3.1 Multi-Criteria Decision Parameters 

In our study, the received signal strength, the distance between the mobile and the 
base station, the previous location, and the moving direction are considered as 
decision parameters. The received signal strength has been used in many schemes, but 
it has very irregular profdes due to the effects of radio environments. The distance is 
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considered because it can explain the block allocation plan; however, it may also be 
inaccurate due to the effect of multi-path fading, etc. It is not sufficient by itself We 
consider the previous location. It is normally expected that the estimated location 
should be near the previous one. Therefore, if the estimated location is too far from 
the previous one, the estimation may be regarded as inaccurate. We also consider the 
moving direction. Usually the mobile is most likely to move forward, less likely to 
move rightward or leftward, and least likely to move backward more than one block. 
The low-speed mobile (a pedestrian) has a smaller moving radius and a more complex 
moving pattern, while the high-speed mobile (a motor vehicle) has a larger radius and 
a simpler pattern. 

In mobile tracking using MCDM, the Decision function D is defined by combining 
the degree of satisfaction for multiple evaluation parameters, and the decision is made 
on the basis of his function. The evaluation parameter can be seen as a proposition. A 
compound proposition is formed from multiple evaluation parameters with a 
connective operator, and the total evaluation is performed by totaling the values for 
the multiple parameters with connective operators. In this method errors in the 
evaluation parameters impose milder changes on the total evaluation value than in 
binary logics. 



3.2 Membership Function 

The membership function with a trapezoidal shape is used for determining the 
membership degree of the mobile because it provides a more versatile degree between 
the upper limit and the lower one than the membership function with a step-like shape. 
Let us define the membership functions for the pilot signal strengths from 
neighboring base stations. 

The membership function of PSS,, jUj^(PSSi), is given by Figure 2. PSS, is the 
signal strength received from the base station i, ij is the lower limit, and $2 is the 
upper limit. 




PPS. distance 

Now we define the membership function of the distance. The membership function of 
the distance, /uji(Dj ) , is given by Figure 3, where D. is the distance between the 
base station i and the mobile, t/j is the upper limit, and d .2 is the lower limit. 
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The membership funetion of the previous loeation of the mobile, ) , is givetr by 

Figure 4. Where L, is the veetor information of its eurrent loeation, £i ,-,£4 is the 
veetor information of the previous loeation, and g, is the physieal differenee 
between them. 




Fig. 4. The membership function of the Fig. 5. The membership function of the 

location. direction. 



The membership funetion of the moving direetion, given by Figure 5. Q 

is the veetor information of the moving direetion, P55i, -,P554 is the pilot signal 
strength and o, the physieal differenee between the previous loeation and the eurrent 
one 



3.3 Location Estimation 

Most of the MCDM approaehes faee the deeision problem in two eonseeutive steps: 
aggregating all the judgments with respeet to all the eriteria and per deeision 
alternative and ranking the alternatives aeeording to the aggregated eriterion. Also our 
approaeh uses this two-steps deeomposition [13]. 

Let /, (i e {1, 2, . . . , n} be a finite number of alternatives to be evaluated against a set 
of eriteria Kj (j=l, 2, . . ., m). Subjeetive assessments are to be given to determine (a) 
the degree to whieh eaeh alternative satisfies eaeh eriterion, represented as a fuzzy 
matrix referred to as the deeision matrix, and (b) how important eaeh eriterion is for 
the problem evaluated, represented as a fuzzy veetor referred to as the weighting 
veetor. 

Eaeh deeision problem involves n alternatives and m linguistie attributes 
eorresponding to m eriteria. Thus, deeision data ean be organized in a m x « matrix. 
The deeision matrix for alternatives is given by Eq. (1): 







Mr (^ 12 ) 


Mr (^ 13 ) 


Mr(Ci 4 ) 




/^R {PSS2I ) 


Mr (^12 ) 


Mr (^ 13 ) 


Mr{C 14) 






Mr i.P\2 ) 


Mr (^ 13 ) 


Mr(C 14) 
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MRiPnl) 


Mr (Pni) 
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The weighting vector for evaluation criteria can be given by using linguistic 
terminology with fuzzy set theory [14]. It is a finite set of ordered symbols to 
represent the weights of the criteria using the following linear ordering: very 
high > high > medium > low > very low. Weighting vector W is represented as 
Eq. (2). 



Jjr / PSS D L C\ 
w = (W. , W,. , , W,. ) 



( 2 ) 



The fuzzification procedure leads to a performance matrix jj e [0,1] where each 
element expresses how much the n-th alternative satisfies the m-th criterion. 
Therefore, each low of the performance matrix is a fuzzy set ju^ expressing the 
satisfaction of the m-th criterion in the universe of the available alternatives [13-14]. 
By multiplying the weighting vector by the decision matrix, the performance matrix is 



(3) 



[_PRiPS^rx)M'' PRiPn2>^ /“it(A3)><^ 

given by Eq. (3): 

Given the decision matrix and the weighting vector, the decision making objective for 
the general fuzzy MCDM problem is to rank all alternatives by giving each of them 
an overall preference rating with respect to all criteria [14]. GMV (Generalized Mean 
Value) is used for ranking the alternatives according to the aggregated criterion. The 
GMV for alternatives is represented as Eq. (4). 



rt<P„) = 



3-[(q+£>.)-(4.+^.)] 



(4) 



where A. (P55„i ) x wf® , B, = ju„{D^^)xwf , and C,. = (L„ 3 ) x wf , 

Di = Mr (C„„ ) X wf , respectively. 



4 Performance Analysis 

The moving path and the mobile velocity are affected by the road topology. The 
moving pattern is described by the changes in moving direction and velocity. In our 
study we assume that low speed mobiles, pedestrians, occupy 60% of the total 
population in the cell and high-speed mobiles, vehicles, 40%. One half of the 
pedestrians are assumed to be still and another half moving. Also the private owned 
cars occupy 60% of the total vehicle, the taxi 10% and the public transportation 30%. 
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Vehicles move forward, leftward/rightward and U- Turn. The moving velocity is 
assumed to have a uniform distribution. The walking speed of pedestrians is 0~5 
Km/hr, the speed of private cars and taxis 30-100 Km/hr, and buses 10-70 Km/hr. 
The speed is assumed to be constant during walking or driving. Figure 4 shows the 
road used in our simulation to consider traffic environments. The black circle 
indicates the branch of the road, and the shaded areas are blocks that the road passes 
through. Each block is a square and its side is assumed to have the length of 30 m. 
The time needed for a high speed mobile to pass through a block is calculated from 
BT - r I V where r is the length of the road segment crossing at each block and v the 
mobile speed. As shown in Figure 4, BT is dependent on r. We can consider four 
different values - r, n-v/2 m (crossing diagonally), (3/4 crossing), m 

(2/4 crossing) and —J 2 n (1/4 crossing) - according to which portion of a block each 
4 

road segment crosses through. In order to reflect more realistic information into our 
simulation, it is assumed that the signal strength is sampled every 0.5 sec, 0.2 sec, 0.1 
sec, 0.1 sec and 0.05 sec for the speed of <10km/h, < 20km/h, <50km/h, <70km/h and 
< lOOkm/h, respectively. If BT is too small, we cannot obtain enough samples to 
calculate the average signal strength. We consider the following simulation 
parameters regarding the received signal strength. The value of ^ 2 , which indicates 
the changes in LOS/NLOS environments, is in the range of 20 through 50. The mean 
signal attenuation by the path-loss is proportional to 3.5 times the propagation 
distance, and the shadowing has a log-normal distribution with a standard deviation of 
a = 6dB. A value of received signal strength less than -16 dB is regarded as an error, 
which is therefore excluded from the calculation. 

To evaluate the error probability of our schemes in each estimation stage, the 
mobile population in the track boundary and sector boundary is generated according 
to a Poisson distribution. All the mobiles generated above are assumed to cross the 
sector boundary lines and track boundary lines. Also the curved path passes through 
the handoff area. Stationary mobiles appear at sector boundary areas and track 
boundary areas and remain still at those points. Pedestrian mobiles appear at sector 
and track boundary areas and move toward the neighboring sector or track. 

Figure 6 shows the effect of the block size on the estimation performance of MCDM 
and the existing schemes. As the block size becomes smaller, the accuracies of the 
three schemes decrease. The accuracies of AOA and TOA decrease rapidly. On the 
other hand, the performance of MCDM is least affected by block size because it 
additionally utilizes the previous location and distance between the mobile and the 
base station for estimation. 

Figure 7 shows the estimation rate or accuracy of our proposed scheme depending on 
the mobile speed. The accuracy of AOA and TOA becomes lower rapidly since the 
signal measurement error would be large as the mobile speed increases. The 
performance of MCDM is least affected by the mobile speed because the information 
such as moving direction and previous location are considered in MCDM and, 
therefore, errors during the signal evaluation step decrease. We compare our scheme, 
MCDM with VA [12], E-OTD and TDOA in Figure 8. In this figure the mobile 
maintains its y position at 1000m and traverse x axis from x = Om to v = 2000m. 
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Fig. 6 .The estimation accuracy versus the 
size 




The speed (km/h) 



Fig. 7. The estimation accuracy versus the 
speed 



Distance Root Mean Square stands for the root-mean-square value of the distances 
from the true location point of the position fixes in a collection of measurements. In 
order to get the estimated values for comparison we take an average of 20 values for 
each mobile position. We assume NLOS environment and the signal level of mobile 
may change abruptly due to shadowing. It shows that the performance of MCDM is 
least affected by abrupt change of signal level. MCDM has the most accurate result. 
This may well be attributed to the fact that it imposes less weight on the received 
signal strength in NLOS area and, instead, greater weights on other parameters such 
as the distance between mobile and base station, previous location, and moving 
direction are considered as decision parameters 




Fig. 8. Comparison of the estimation accuracy 



5 Conclusions 

In this study, we proposed a MCDM-based mobile tracking method for estimating 
more accurately the mobile location by considering multiple parameters, such as the 
signal strength, the distance between the base station and mobile, the moving 
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direction, and previous location. We have demonstrated that our scheme increases the 
estimation accuracy when the mobile moves along a boundary area. Further, we have 
shown that the proposed scheme is little affected by an increased mobile speed or a 
decreased block size. The effect of weight factor variations on the estimation 
performance of our scheme and the determination of the optimal weight should be the 
subject of a future study. 
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Abstract. Currently, the major focus on the network security is se- 
curing individual components as well as preventing unauthorized access 
to network services. Ironically, Address Resolution Protocol (ARP) poi- 
soning and spoohng techniques can be used to prohibit unauthorized 
network access and resource modifications. The protecting ARP which 
relies on hosts caching reply messages can be the primary method in 
obstructing the misuse of the network. This paper proposes a network 
service access control framework, which provides a comprehensive, host- 
by-host perspective on IP (Internet Protocol) over Ethernet networks 
security. We will also show how this framework can be applied to net- 
work elements including detecting, correcting, and preventing security 
vulnerabilities. 



1 Introduction 

Along with development of communication networks, the problem of network 
security has increasingly become a global challenge. Reflecting through these 
trends, the key focus on the network security is securing individual components 
as well as preventing unauthorized access to network services. Although IP over 
Ethernet networks are the most popular Local Area Networks (LANs) nowa- 
days, an ignorance of the network security in designing TCP/IP (Transmission 
Control Protocol and Internet Protocol) has led important network resources to 
be wasted or damaged. 

Among the network resources, IP address, a limited and important resource, 
is increasingly misused, which results from its inexperienced and malevolent 
purposes to cause a security problem or damage the entire networks. As an 
IP address is the only one to identify itself, the same IP address cannot be 
simultaneously used in other equipments. If IP addresses, which are respectively 
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Attacker 




Normal Traffic 

Faked Traffic — ' — — 



Fig. 1. ARP Spoofing and Poisoning Attack 



set by hosts in the network, are misused for some inexperienced or malevolent 
purposes, the security problem could be triggered in the network. 

IP over Ethernet networks use ARP to resolve IP addresses into hardware, or 
MAC (Media Access Control) addresses. All the hosts in the network maintain 
an ARP cache which includes the IP address and the resolved hardware or MAC 
addresses. ARP resolution is invoked when a new IP address has to be resolved 
or when an entry in the cache expires. As shown in Fig.l, the ARP poisoning 
and spoofing attack can easily occur when a malicious user tries to modify the 
association of an IP address and its corresponding hardware or MAC address by 
disguising himself of being an innocent host [1]. 

In this study, we propose an unauthorized network access control framework 
in IP over Ethernet networks that guarantees fast and continuous network pro- 
tection. To this, we propose a network access control scheme based on ARP 
spoofing and demonstrate how this concept can be applied to network elements, 
services, and applications. In addition, we demonstrate how the security frame- 
work can be applied to all layers of the TCP/IP protocol suite. 

The rest of this paper is organized as follows. The background relevant for 
ARP operations and details of proposed framework are described in section 2 
and 3, respectively. Finally the paper concludes in Section 4. 



2 Network Security and ARP Operations 

2.1 Network Security 

The network security technologies have been studied to prevent increasingly vari- 
able and sophisticated attacks on a network. Currently, they include an intrusion 
detection system that detects a sign of an attack, a firewall that mainly blocks 
the traffic of a detected attacker, a response system i.e., a packet filtering router 
to protect its domain, and many other systems to enhance network survivability. 
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Fig. 2. ARP Mechanism 



The network survivability refers to continuing the operation of a system to 
provide services even though it has been damaged by network attacks, system 
failures, and other overloads. While early security technologies mainly covered 
screening for a single computer attack, the contemporary security technologies 
have been developed to resolve and resist those network attacks. That is, the 
network survivability has focused on systematically managing the configuration 
of the network and its components. 

The IP address management refers to securing the network survivability 
by monitoring and disabling the function of the host when a system detects 
the worms or other abnormal behaviors, as well as an intentional or malicious 
changes of the IP address. Thus, the IP management and blocking the misuse of 
IP address come to serve as a new concept in the security solution for controlling 
the network. 
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2.2 ARP Operations 

The rest of this section briefly states how ARP operates. The ARP provides 
mapping between the IPv4 address and the Ethernet address. When an Ethernet 
frame is sent from one host to another, the 48 bit Ethernet address determines 
the interface to which the frame is destined. When a host needs to send an IP 
datagram as an Ethernet frame to another host whose MAC address it ignores, 
it broadcasts a request for the MAC address associated with the IP address of 
the destination. Every host on the subnet receives the request and checks if the 
IP address in the request is bound to one of its network interfaces. If this is the 
case, the host with the matching IP address sends a unicast reply to the sender 
of the request with the <IP address, MAC address> pair. Every host maintains 
a table of <IP, MAC> pairs, called ARP cache, based on the replies it received, 
in order to minimize the number of requests sent on the network. 

ARP is a stateless protocol, i.e., a reply may be processed even though the 
corresponding request was never appeared. When a host receives a reply, it 
updates the corresponding entry in the cache. While a cache entry should be 
updated only if the mapping is already present, some operating systems, e.g., 
Linux and Windows, cache a reply in any case to optimize performance. Another 
stateless feature of ARP is the so called gratuitous ARP. A gratuitous ARP is 
a message sent by a host requesting the MAC address for its own IP address. It 
is sent either by a host that wishes to determine if there is another host on the 
LAN with the same IP address or by a host announcing that it has changed its 
MAC address, thus allowing the other hosts to update their caches [I]. 

The Gratuitous ARP checks if there is any other host using its IP address 
when the host initially boots itself to start the network. A system that uses an 
unauthorized IP address may cause some problems to other hosts using Gratu- 
itous ARP. For example, a server system of which IP address has already been 
preoccupied by another system during its rebooting can-not use the network. 
That is, the IP address may cause internal security problems in the network, not 
from externally [2] [3] [4] . 

3 Proposed Network Access Control Framework 

3.1 Gratuitous ARP 

Using the gratuitous ARP, a host can check if the IP address is used by other 
hosts in order to avoid using duplicated IP address. Table 1, Table 2, Table 3, 
and Table 4 shows different types of collisions in using gratuitous ARP for each 
OS respectively. The ON (Offending Node) denotes a host which tries to use the 
IP, and the DN (Defending Node) denotes a host using the IP. 

3.2 An Unauthorized Access Control Framework 

Fig. 2 illustrates an algorithm by the hosts in processing an ARP message. In 
processing the ARP algorithm, there can be many different types of attacks 
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Table 1. Windows (A) to Windows (B) Gratuitous ARP 



Type Sender Receiver Source IP Source MAC Target IP Target MAC 

Request B’s MAC All B B’s MAC B Ignored 

Response A’s MAC B’s MAC A A’s MAC B B’s MAC 
Request A’s MAC All A A’s MAC A Ignored 



Table 2. Windows (A) to Linux (B) Gratuitous ARP 



Type Sender Receiver Source IP Source MAC Target IP Target MAC 
Request B’s MAC All B B’s MAC B Ignored 

Table 3. Linux (A) to Windows (B) Gratuitous ARP 



Type Sender Receiver Source IP Source MAC Target IP Target MAC 

Request B’s MAC All 0.0. 0.0 B’s MAC B Ignored 

Response A’s MAC B’s MAC A A’s MAC O.O.O.O B’s MAC 



Table 4. Linux (A) to Linux (B) Gratuitous ARP 



Type Sender Receiver Source IP Source MAC Target IP Target MAC 

Request B’s MAC All O.O.O.O B’s MAC B Ignored 

Response A’s MAC B’s MAC A A’s MAC A A’s MAC 



Table 5. Vulnerable Points of ARP 



Number 


Problems 


Cause 


1 


Duplicated IP address 


UNIX/Linux Server 


2 


ARP Cache Forgery 


MAC address Forgery 


3 


Authorized IP 


Malicious gratuitous ARP Response 




Blocking 


to authorized host 


4 


Unauthorized IP 


IP address alteration of 




Misappropriation 


unauthorized host 



such as ARP Spoofing, MAG Flooding, ARP Redirect, MAG Duplicating, etc. 
Different types of attack that can occur for each step is as follows. 

• <Step 1> : <Step 1> is a stage that defines the types of hardware interface 
and upper layer protocol. The host needs to provide verification function to 
check if the protocol and the packet format have a valid MAG layer access 
protocol. 

• <Step 2> : <Step 2> is a process that updates its current ARP cache based 
on ARP request message. If the hardware address and the protocol address is 
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already exists in the cache table, the host only needs to update the table. At 
this point, network security problem may occur if the third party broadcasts 
a packet with invalid MAC address. For example, if a host send an ARP 
request message with the invalid MAC address of A, the hosts that were 
communicating with the host A will change its cache with the invalid MAC 
address of host A. Thus, the host A would not be able to use the network. The 
attacker will continually generate the packet with the invalid MAC address. 
Thus, it can perform different forms of packet sniffing such as ARP table 
flooding, APR Spoofing, ARP Redirect, MAC Duplicating, etc [5]. 

• <Step 3> : <Step 3> is a process in which the host sends an ARP reply 
message for ARP request with its destination address. The problem of this 
step is that if the host is in hardware reboot status or if the host tries to 
change its non-used IP address, the host would be unable to use the network 
if a third-party deliberately sends a fake reply message. 



Table 5 shows the security problems that can occur during the ARP oper- 
ation. The two types of solution to the security problem 1 to 4 are modifying 
ARP process and managing IP and MAC addresses by monitoring ARP packets. 
The modifying method recognizes the fact that the IP and MAC address can 
always be changed by anyone thus it has no way of finding out who is privileged 
user. But, this method is not a perfect solution, due to the fact that the verifi- 
cation of gratuitous ARP is impossible and the ARP packet cannot be altered. 
But the security problem can be solved by managing IP and MAC addresses by 
monitoring ARP packets. 

Just as IP spoofing, ARP spoofing also prevents a host in the network from 
functioning normal network processing by preventing it to perform ARP reply 
for ARP requests. If the host tries to perform ARP reply, the attacker uses the 
IP address of the incompetent host and configures it as a target host. When 
a victim host tries to communicate with the incompetent host, the attackers 
system will perform ARP reply to all the ARP broadcast request instead of the 
incompetent host. Thus, the MAC address of the victim system is stored in the 
attackers ARP cache, and the victim system will mistake the attacker as the 
incompetent system and perform normal communication with the attacker [1] 
[6] [7]. Ironically, techniques of preventing ARP poisoning and spoofing can be 
used to prohibit unauthorized network access and resource modifications. 

The distributed network environment covered in this study includes manager 
and agent system. The agent is installed in each broadcast domain (including 
Virtual LAN) to collect packets generated within the domain. The manager en- 
forces policies to block the unauthorized accesses detected by the agent in the 
network. The Agent uses the ARP spoofing technology to manage the network. 
It also creates the ARP packet under the order from the manager to confirm 
the up/down status of the network nodes and to obtain the MAC address, addi- 
tionally shutting down the network against an unauthorized IP. Particularly, the 
ARP Request means an important message to define the ARP cache table of all 
hosts in the network through the ARP spoofing. Fig. 3 shows the module struc- 
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Fig. 3. The Module Architecture of Manager and Agent System 




Fig. 4. The Process Module of Agent System 



ture of the manager and agent system and Fig. 3 shows the process architecture 
of the agent system, respectively. 
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Table 6. ARP message process to block IP address 



Type 


Sender 


Receiver Source IP Source MAC Target IP Target MAC 


Request 


Agent 


All 


B 


Incorrect 


B 


Not nsed 


Response Blocked Incorrect 


B 


B’s MAC 


B 


Incorrect 



Table 7. ARP message process to interfere with the access of the blocked host 



Type 


Sender 


Receiver Source IP Source MAC Target IP Target MAC 


Reqnest 


Blocked 


All 


B 


B’s MAC 


C 


Not used 


Response Common 


Blocked 


C 


C’s MAC 


B 


Incorrect 


Reqnest 


Agent 


All 


B 


Incorrect 


B 


Not used 


Response Blocked 


Incorrect 


B 


B’s MAC 


B 


Incorrect 



Table 8. ARP message process to interfere with the access to the common host 



Type 


Sender 


Receiver Source IP Source MAC Target IP Target MAC 


Request 


Common 


All 


C 


C’s MAC 


B 


Not used 


Response Blocked 


Common 


B 


B’s MAC 


C 


C’s MAC 


Request 


Agent 


All 


B 


Incorrect 


B 


Not used 


Response Blocked 


Incorrect 


B 


B’s MAC 


B 


Incorrect 



3.3 Unauthorized Access Control Schemes 

Table 6 shows how to block the IP address. ’’Blocked” refers to the blocked 
host and ’’Common” refers to another common host (C) on the same network. 
Host blocking and releasing includes processes that send/receive the ARP Re- 
quest messages to select a host to be blocked/released, and to confirm the MAC 
address. In line 1, the Agent broadcasts the incorrect MAC address of (B) to 
update the ARP cache table, which contains the address of the blocked host, 
with incorrect MAC address of other hosts. 

Table 7 shows the process in which the blocked host attempts to have an 
access to other hosts. If (B) sends the ARP Request message to request the 
MAC address of (C), (C) will normally response to allow the blocked host to 
communicate. If this is the case, the Agent broadcasts the ARP Request message 
containing the incorrect MAC address to set the incorrect (B) MAC address in 
the ARP cache table of (C). 

Table 8 shows the process where the Agent interferes with the access of 
other hosts to (B). If (C) sends the ARP Request message to request the MAC 
address of (B) in order to access (B), the (A) sends the ARP Response message 
containing the incorrect MAC address of the blocked host. Then, (C) will have 
the incorrect MAC address of (B) by updating the ARP cache table with the 
request message lately received from the Agent. 

Table 9 shows how to release the blocked IP. The blocked IP will be released 
when (A) sends the gratuitous ARP packet for (B). Other hosts can obtain 
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Table 9. ARP message process to block IP address 



Type 


Sender 


Receiver Source IP Source MAC Target IP Target MAC 


Request 


Agent 


All 


B 


B’s MAC 


B 


Not used 


Response Blocked 


Agent 


B 


B’s MAC 


B 


B’s MAC 



the correct MAC address of the blocked host, freely sending/receiving the ARP 
request/response message without future interferences from (A). 

4 Conclusions 

IP address which is a limited and important resource is increasingly misused, 
which results from its inexperienced and malevolent purposes to cause a security 
problem or damage the entire networks. Because a system that uses an unau- 
thorized IP address may cause some problems to other hosts, the IP ad-dress 
may cause internal security problems in the network, not from externally. 

In this paper, we propose an unauthorized network service access control 
framework focusing on the management and security of the IP, a network re- 
source. This system consisting of agent and manager uses the network monitor- 
ing and the IP blocking algorithm to integrate the networks so as to effectively 
manage the IP resources. The agent can be expanded by installing Simple Net- 
work Management Protocol (SNMP) agent to the IP integration management. 

This system also presents the possibility of developing the integration man- 
agement system to protect the network from the external virus attacks. This 
study worked upon a system operating under the IPv4 environment, which will 
come to be needed under the IPv6 that is expected to get its popularity. The 
same network blocking mechanism as in the IPv4 network can optionally be 
operated on Internet Control Message Protocol version 6 (ICMPv6). 
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Abstract. This study proposes a network traffic monitoring system that will 
support the operation, management, expansion and design of a network system 
for its users through an analysis and diagnosis of the network-related equip- 
ments and lines in the PC-room. The proposed monitoring system is lightweight 
for its uses under the wireless environment and applies a web-hased technology 
using JAVA to overcome the limits to managerial space, inconveniences and 
platform dependency and to raise its applicability to and usability on the real 
network based on performances, fault analyses and their calculation algorithm. 
The traffic monitoring system implemented in this study will allow users to ef- 
fectively fulfill the network management including network diagnoses, fault de- 
tection, network configuration and design through the web, as well as to help 
users with their managements by presenting how to apply a simple network. 



1 Introduction 

With the surprising developments in Internet, the users on network have rapidly in- 
creased so that the traffic on network leads to an explosive increase in many compa- 
nies, schools and public institutions. Along with the developments in network tech- 
nologies and the uses of various applications, network traffic includes not only data 
but also voice, picture, image and multimedia traffic. Those increases in network 
users and traffic have raised a need for a massive network line and the resulting 
equipment investments and made network configuration more gigantic and compli- 
cated[l-2]. However, such trends escape the managerial scope of a manager, leading 
to a more need for managements of performance and fault on network. To help the 
managers with their managements, and accordingly, various management tools have 
been developed. Since those tools, however, had such fundamental limits as limited 
management functions, inconveniences, their insufficient expansion into large-scale 
network and problematic applications of analytic results, the managements of manag- 
ers had to be restrictively fulfilled[3]. The solution to problems in the existing man- 
agement technologies and tools is being pursued by applying such new Internet based 
technologies as web related technologies or JAVA to such fields as managements of 
networks, systems or applications[4-5]. This approach is called the web based man- 
agement technology, by which the limits to managerial space and inconveniences can 
be overcome through their application to the web platform for an increase in effi- 
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ciency. Web based network management produets typieally inelude MRTG (Multi 
Router Traffie Grapher) that eolleets management information through SNMP, stores 
traffre data into GIF and outputs the results in the form of FITML eontaining GIF files, 
N-Vision developed through JAVA interfaee of HP OpenView management platform, 
IntraSpeetion using JAVA SNMP-Applet, EnterprisePRO with WAN and LAN man- 
agement funetions, ANACAPA SOFTWARE traeking and providing user response 
time, and HP NetMetrix/UX Reporter[6-7]. Though an attempt to apply network 
management teehnology to the web using the platform has been made, many tools are 
for WAN management or monitor segments as well as the eommunieations of all 
nodes within those segment for the management satisfying LAN environment, not 
ineluding a funetion to analyze management information. On top of that, they earry 
problems with them that it is diffieult to know the analytie result by providing statie 
information without proeessing the extraeted management information and that they 
do not provide the eumulative analysis funetion of long-term traffre statisties and 
trends through aeeumulation of management information [8-9]. 

In this study, RMON MIB, the extension of standard MIB-II together with web 
based management teehnology to solve these problems, is applied. It analyzes RMON 
MIB and MIB-II suitable for the network management, deriving relevant MIB objeets 
and defining signifieant network performanee and fault analyses in the position of the 
managers. It also attempts to apply JAVA and web related teehnology to the network 
management, and implements web based network traffre monitoring system to solve 
problems with the existing management tools, made it lightweight for its eonvenient 
use. Finally, it eovers a simple applieation of network management to help managers 
optimize their managements. 



2 Design and Implementation of Traffic Monitoring System 

The whole stmeture of network traffie monitoring system proposed in this study is 
shown in Fig. 1 . The system eonsists of analysis server that will eolleet pieees of man- 
agement information by monitoring network aetivities of systems managed on net- 
work to analyze their results and elient system that will provide graphie data to raise 
an applieation of analysis result. 

While the monitoring system eomprises Internet server, intranet server and data- 
base existing on the web, HTML doeuments and JAVA bite eodes on the web server 
are transferred to the elient server for its operation. The elient is implemented in app- 
let, transferred to the server via new eonneetion as requested by managers and an- 
swers the user in the graphie form. At this time is used a message form defined in 
MATP (Management Applieation Transfer Protoeol) to reeeive and transfer message 
[ 10 ]. 

In the ease of the real-time analyses subjeet to eaeh analyzing item of its elient, the 
server answers real-time requests to eolleet, analyze and show its response to infor- 
mation on a real-time base. In a eumulative analysis, the server polls the data on data- 
base to answer the request from the elient. 
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Fig. 1. The whole structure of network traffic monitoring system 



2.1 Client System 

The whole strueture of elient system is shown in Fig. 2. Client system is implemented 
on the web browser of its user client, which comprises such functions as user entry 
interface for the management request from user, real-time monitoring in which re- 
sponses to the request are given, information collection requests/suspensions, accu- 
mulative analysis monitoring and graphic outputs of traffic results received from an 
analysis server and shows a simple application of network management. 




Fig. 2. The whole structure of client system 





A Design and Implementation of Network Traffic Monitoring System 



647 



Function of real-time analysis monitoring. A user must enter the equipment name 
of the managed system, IP address, port number, community, network speed and 
polling time to analyze the conditions for the current traffic and fault of LAN. 
Specifically, such name and IP address are used to show the type of the analyzed 
equipment in output its results. The function of real-time monitoring is a user 
interface that receives a request to analyze the condition for the current traffic of 
LAN from a user, showing the analytic result via the graphic outputs, real-time. Table 
1. shows items of analyzing the function of real-time monitoring: 



Table 1. Items of analyzing real-time monitoring 


type 


analysis items 


contents 


internet 


network utilization 
input/output traffic 
network error 
packet loss 
packet analysis 


rate of utilization of network per unit time 

amount of input or output traffic 

rate of error packet 

rate of input and output error packet 

amount of broadcast packet 


intranet 


segment utilization 
segment collision 
segment error 


rate of segment utilization in use 
rate of segment collision 
rate of segment error packet 



Function of information collection requests/suspensions. It is required for a user to 
collect traffic statistics on segment and analyze its flows and trends to fulfill such 
managerial activities as for increasing its network perfonuance & design and 
diagnosing a fault. In accordance with such collection requests from a user, traffic 
management information is periodically collected to save it on database up to the 
point of suspension requests. 

Function of accumulative analysis monitoring. The function of accumulative 
analyses monitoring is required to enter Request ID, IP address and polling time to 
analyze the condition for traffic and faults during a specified period. Accumulative 
analyses show the analytic result of traffic data collected in response to a collection 
request to a user in the form of graph. The analyzing items of accumulative analyses 
monitoring are the same with that of real-time monitoring as shown in Table 1. 

Function of outputting result graph & application example. In this function, the 
analytic results in response to a real-time monitoring request and accumulative 
analyses request are output to a user in the form of a graph. Graphic outputs include 
line graph, bar graph, and pie graph appropriate for the resulting outputs from an 
analysis of LAN performances & faults. And users make use of a simple application 
example on the result. 
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2.2 Analysis Server 

Analysis server should go through daemon processes because it must provide services 
for a specific port to respond to the request from a client system. The server can be 
largely divided into Internet server and Intranet server. 

Internet server. Internet server processes responses to the analysis request from the 
client, that is, the user interface and the connection setting for such process, message 
generation, or data processing and transferring by each item for their analyses. It is 
installed to operate in a place where the web server is installed. In receiving a 
message for the analysis request, the analysis module transfers it to the processor of 
analyzing items, acquires management information through the call from a SNMP 
user implemented in JAVA to poll the relevant management information according to 
each analyzing item, generates and transfers the analyzed information to the client 
system on a real-time base. 

SNMP Manager. SNMP manager system performs the relevant MIB information 
polling so that the internet server can derive analysis information with respect to a 
message for real-time analysis request. 

Message Processing Module. Message Processing Module processes the received 
message according to the mode requested by a user, analyzes the requested message 
and transfers the relevant processing module. The module interacting with this proc- 
essing module provides a user with the processed results to the analyzing item re- 
quested by the analysis module and the graph generator module in the form of graph. 

Analysis Module. It processes real-time responses to a request from a user to analyze 
the current internet status. This module calls SNMP manager system for the relevant 
MIB information polling to derive the request message from the client and the value 
of the current analyzing item from the data, delivers the obtained information to its 
processor and transfers the analysis information to the client every polls. 

Analyzing Item Processor. It serves as a function that derives the analyzed results 
from the management information polled to each analyzing item by the analysis mod- 
ule. SNMP manager collects the polled management information form network de- 
vices in accordance with a request of the real-time analysis module, calculating the 
analyzed results at a specific point using different analysis methods according to the 
type of each analyzing item and delivering them to the real time analysis module. 



Intranet server. Each module in intranet server has its own function form LAN 
analysis as user request control, RMON setting module, RMON check module, 
analysis management module. 
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User Request Control. It receives and analyzes a message transferred from the client 
and delivers it to process its appropriate requests. This system saves the message from 
the client in the message structure, and then analyzes its header to transfer a control to 
the relevant module corresponding to each analyzing item. 

RMON Setting Module. This module is an RMON control module that sets RMON 
for its validity or invalidity, controlling RMON in a place where this module is re- 
quired in accordance with the analyzing items. 

RMON Check Module. This module inspects the managed system imported form a 
request of a user, that is, RMON, checks the function of RMON probe, and investi- 
gates whether there is no fault in collecting management information. It polls all the 
managed systems, respectively, imported from the requests of users for its identifica- 
tion of their functions and status. 

Analysis Management Module. This module calculates the analyzed results by enter- 
ing the managed information files real-time collected and accumulated to transfer 
them to the client module. 

The proposed system comprises the client system and the analysis server. To give and 
receive a request from a user and its response within the system, each component 
requires a relevant message exchange procedure. 

The client system that a user interfaces itself with sets TCP connection to the 
analysis server to transfer a requested message as a user requests a management, 
when the requested data may be transferred to the server together with the requested 
message, and the server transfers ACK to check the receipt. The server receiving the 
request from the client makes connection to RMON Agent to start polling. RMON 
Agent responds to the processed result to the server, which returns it to its client by 
using the responding message. 



3 Results & Analysis 

Network traffic monitoring system includes such functions as real-time analysis, 
collection request/suspension and accumulative analysis according to the request of 
user. To process these functions on the web, the client system and the analysis server 
contains their relevant processing module. Fig. 3 shows item setting required in 
conducting the function of tree structure interface and relevant processing of 
analyzing items as shown in the client system. In setting each item, the same interface 
structure is involved for the convenience of users, so that items alone requested in 
selecting the analyzing items could be set. 
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Fig. 3. user interface in client system 

In the function of real-time analysis, the conditions for the current uses and fault of 
LAN are analyzed to provide a dynamic graphic view to help a user easily understand 
the network diagnosis. The result display of this real-time monitoring is shown in Fig. 
4, which shows examples of graphs demonstrating line availability rate on LAN, real- 
time, and input and output traffic rate. Such setting helps optimize the management 
by presenting a simple application to the network management 




Fig. 4. Results of real-time analysis 

In the function of collection request/suspension, LAN is monitored including its 
requests for the collection of management information to cumulate and analyze the 
management information of LAN. To fulfill this function, a user must enter RMON 
agent IP address, community, segment speed, management information of the man- 
aged segment, based on which the analysis server collects the management informa- 
tion. When such function of the collection request/suspension is fulfilled, the relevant 
message is displayed before a user. 

The function of accumulative analysis involves analyzing the basic management 
considerations on the web. This function is implemented to facilitate the understand- 
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ing of user by providing information about the analyzed results calculated on a basis 
of the cumulated information during a certain period through a graphic view of vari- 
ous types. During the collected period, a user can compare pieces of information such 
as line availability rate, error rate and collision rate through various graphic views to 
identify the abnormality of LAN. Fig. 5 shows the result of this function. 








Fig. 5. Results of accumulative analysis 

And the proposed system can manage to PC-room without an expert skill by offer- 
ing a variety application example. That is, a manager can check a managing status 
showing currently information and a simple application on each items. Fig. 6 shows 
the result of internet information analyses, which can to help managers perform net- 
work management. 




Fig. 6. Results of internet/intranet information analysis 
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4 Conclusions 

In this study, we proposed the web-based network traffie monitoring system aimed to 
eliminate the eonstraints in managements, along with providing more user-friendly 
management tools. The proposed network traffie monitoring system is eomposed of 
elient system and server system to provide management effieieney and distributed 
management funetion: The elient system is implemented under JAVA and web re- 
lated teehnology to provide the graphie funetion of elearly and dynamieally demon- 
strating the user interfaee and the analyzed results on the web, while the elient pro- 
vides sueh the funetions as real-time analysis, eolleetion request/suspension, and 
eumulative analysis in whieh a request from a user is reeeived. The analysis server 
analyzes and proeesses requests from a user transferred from the elient to return their 
result to the elient. Aeeordingly, the analysis server involves the funetion in whieh 
eaeh request ean be simultaneously proeessed through thread. 

The network traffie monitoring system proposed in this study diagnoses the quality 
of and the eonditions for network in the view of network manager to provide optimal 
performanee, failure reeovery and the management information that are measures of 
network eonfiguration. So, it is expeeted to help effeetively fulfill the managements 
on the eomplieated LAN where the manager has diffieulty in handling sueh manage- 
ments. 
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Abstract. Open Vulnerability Assessment Language (OVAL) is a standard lan- 
guage which is used to detect the vulnerability of local system based on the sys- 
tem characteristics and configurations. It is suggested by MITRE. OVAL con- 
sists of XML schema and SQL query statements. XML schema defines the vul- 
narable points and SQL query detects the vulnerable and weak points. This pa- 
per designed and implemented the vulnerability assesment tool with OVAL to 
detect the weak points in Linux System. It has more readability, reliability, sca- 
lability and simplicity than traditional tools. 



1 Introduction 

The vulnerability assessment tool is a seeurity tool to diagnose the computer system 
and detect the weakness in advance to keep the system’s status safe. The vulnerability 
assessment tools are broadly classified as host-based assessment tool, network-based 
assessment tool and application assessment tool to detect the specific applications’ 
attack-ability. Existing vulnerability assessment tools detect the system’s weakness 
by executing the attack code such as exploit scripts [1]. But, individual tools don’t 
have common criteria with vulnerability detection and vulnerability assessment 
scripts are implemented with various programming languages. So it is difficult to 
know which tools provide more correct diagnoses, as well as the prices to develop 
and maintain the assessment script gets higher. MITRE suggested the OVAL (Open 
Vulnerability Assessment Language) to overcome these limitations. OVAL is a stan- 
dard language to assess the fragility of the local system based on the information of 
the system’s characteristics and the configurations. Basically OVAL defines the 
weakness of CVE with XML schema. Using these XML schemas, it constructs and 
executes the query statements to detect the weak points. 

This paper designed the host-based vulnerability assessment tool in the RedHat 
Linux System with OVAL which has been proposed by MITRE. In the chapter two, 
related works, we analyzed and compared the existing assessment tools with OVAL. 
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2 Related Work 

2.1 The Vulnerability Assessment Tool 

The vulnerability assessment tool is a sort of seeurity tool to keep the systems more 
safe by diagnosing the weak points in the eomputer systems in advanee and providing 
the solutions and the proper pateh information. It is also ealled as vulnerability sean- 
ner or seeurity seanner. These seanners are elassified as a host seanner and network 
seanner in aeeordanee with the eheeking eontents [2]. Host seanner is installed at 
eaeh operator’s platform. It searehes the seeurity problems whieh ean be eaused by 
the administrator’s mistakes or mis-eonfigurations [3]. The network seanner assesses 
the portable weak points whieh ean be attaeked by the external haekers. 

The vulnerability seanner usually uses the deteetion seripts sueh as exploit to find 
weak points. But eurrently used eommereial or free eodes have a problem that the 
deteetion results are not reliable, beeause they apply some different eriteria in the 
vulnerability assessment and the eodes are made with different deseription languages 
with wide variety. Table 1 shows the free vulnerability assessment tools and their 
used languages in deteetion seripts. 



Table.l. Free Vulnerability Assessment Tools and Their Languages 



Name of Tools 


Type 


Used Languages 


Tiger 


Host scanner 


C, Shell Script 


COPS 


Host scanner 


C, Shell Script 


Nessus 


Network scanner 


NASL 


SARA 


Network scanner 


C, Perl 


SAINT 


Network scanner 


C, Perl 


Sscan 


Network scanner 


C 


Vlad 


Network scanner 


Perl 



2.2 OVAL 

OVAL is the eommon language for seeurity experts to diseuss and agree upon teehni- 
eal details about how to eheek for the presenee of vulnerabilities on a eomputer sys- 
tem. The end results of the diseussions are OVAL queries, whieh perform the eheeks 
to identify the vulnerabilities [1,4]. 

OVAL queries are written in SQL and use a eollaboratively developed and stan- 
dardized SQL sehema as the basis for eaeh query. OVAL queries deteet the presenee 
of software vulnerabilities in terms of system eharaeteristies and eonflguration infor- 
mation, without requiring software exploit eode. By speeifying logieal eonditions on 
the values of system eharaeteristies and eonflguration attributes, OVAL queries ehar- 
aeterize exaetly whieh systems are suseeptible to a given vulnerability. 

OVAL queries are based primarily on the known vulnerabilities identified in Com- 
mon Vulnerabilities and Exposures (CVE), a dietionary of standardized names and 
deseriptions for publiely known information seeurity vulnerabilities and exposures 
developed by The MITRE Corporation in eooperation with the international seeurity 
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security community. CVE common names make it easier to share data across separate 
network security databases and tools that are CVE-compatible. CVE also provides a 
baseline for evaluating the coverage of an organization's security tools, including the 
security advisories it receives. For each CVE name, there are one or more OVAL 
queries. Fig.l shows the operational procedure of the assessment tool based on 
OVAL. 




CVE 

Stand«rdmd 
vuIrwratMlIty namn 



OVAL 

Standarduad 
vuhKfaMity txiittna 
tens 









Security 

Assessment 

Tooh 



System 



T 



Reports 



Secured System 



Fig. 1. Operational Procedures of OVAL 



2.3 OVAL Schema 

XML and SQL languages have a strong point of defining the vulnerable points most 
logically and clearly. Those languages can be understood by computer systems, and 
much readable to security experts. XML schema’s purpose is to define vulnerabilities 
in the system and consists of the common schema and the per-platform schema. The 
common schema describes the fundamental infomiation required to define vulner- 
abilities. And the per-platform schema describes operational elements to check on 
each platform. 



3 Design of Vulnerabilities Assessment Tool 

3.1 Overall Structure 

In this paper, we designed the vulnerability assessment tool designed for RedHat 
Linux Platform with OVAL schema suggested by MITRE. Its overall structure is as 
in Fig. 2. 
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Data File consists of “INSERT DATA” part and “OVAL queries” part. In 
“INSERT DATA” part, the lists of data to be collected by the “System Information 
Collecting Module” are presented. In “OVAL queries” part, the conditions to detect 
the system’s vulnerability based on the system information collected from input data 
using “query interpreter” module are described in the form of SQL query statements. 



Data File 

(Insert Data + OVAL Queries) 




Data File Verification Log Management 

Module Module 

^ 1 




System Information Query Reporting 

Collecting Module Interpreter Module 




Database System 
(SQLite) 




OS Platform (Red Hat Linux) 





Fig.2. Overall Structure of Vulnerability Assessment Tool 

“Data File Verification Module” verifies whether the given data file is correct or not. 
“Log Management Module” deals with the errors which can be occurred in the sys- 
tem. “System Information Collecting Module” has two roles in the vulnerability as- 
sessment tool. The one role is to collect various system information such as configu- 
ration setting information, software installation information, file information and 
process information based on “INSERT DATA.” And the other role is to update da- 
tabase status based on the collected data. Because the “OVAL Queries” part is de- 
scribed with SQL language, OVAL-based vulnerability assessment system should 
contain DBMS (Database Management System). In our design, we used SQLite as 
DBMS. It operates in a file-base. We summarized the general characteristics of 
SQLite in table 2. 



Table.2. Characteristics of SQLite 



1 Characteristics 


Descristion | 


SQL compatibility 


Support almost syntax of SQL92 


Speed 


Two times faster than the conventional DBMS in gen- 
eral command processing 


Size 


About 25K lines C code. Very lightweight DBMS. 


Database 


All of Database is included in one file. 


Operational Envi- 
ronment 


Can be executed without the help of other libraries. 
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3.2 Database Design: Schema 

System Data gathered by “System Information Colleeting Module” is stored in data- 
base. OVAL Query statements are applied to this database to find eorresponding 
vulnerabilities. Tables of database are eonstrueted using OVAL sehema of individual 
OS platform. In the RedHat-series Linux Platform, we designed the required sehema; 
File Metadata Sehema, Internet Server Daemon Sehema, Passwd File Sehema, 
Shadow File Sehema, Proeess Information Sehema, RPM Information Sehema, In- 
formation for Comparison with Suseeptible RPM Versions Sehema and Operating 
Platform Information Sehema. As an example, the File Metadata Sehema is shown in 
Table 3. 



Table.3. File Metadata Schema 



Field 


Description 


FilePath 


Absolute path of a file 


FileType 


Directory, normal file, device file, etc. 


UserlD 


Owner ID of a file 


GroupID 


Group ID of a file 


time 


Access time (Atime), Status Change Time (Ctime), Data 
Update Time(Mtime) 


MD5 


MD5 hash for a file 


Permission 

bits 


SUID, SGID, STICKY, UREAD, UWRITE, UEXEC, 
GREAD, GWRITE, GEXEC, OREAD, OWRITE, OEXEC 



3.3 Construction of System Information Base 

“System Information Colleeting Module” plays two roles; 1) eolleeting the required 
system information to assess the vulnerabilities in the system, 2) refieet that informa- 
tion to the database designed in subseetion 3.2. The data list whieh this module 
should eolleet is listed up in “INSERT DATA” part in Fig. 2. OVAL uses this 
“INSERT DATA” part to reduee the time of eolleeting required system information. 
In other words, “INSERT DATA” part lists up not all the information of installed 
paekages and files, but only the information items required to assess the vulnerability 
of the system. “System Information Colleeting Module” eonsists of 8 sub-modules. 
Their names are taken after the eorresponding sehema. They are File Metadata Col- 
leeting Sub-module, Internet Server Daemon Information Colleeting Sub-module, 
RPM Information Colleeting Sub-module, RPM Version Comparison Sub-module, 
Password File Information Colleeting Sub-module, Proeess Information Colleeting 
Sub-module, Shadow File Information Colleeting Sub-module and Operating Plat- 
form Information Colleeting Sub-module. 

Fig. 3 is the File Metadata Table whieh is one of the tables produeed by the opera- 
tion of “System Information Colleeting Module.” As same as this table, other infor- 
mation required to assess the system is eolleeted in the form of SQLite Database 
Table. 
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Fig.3. File Metadata Table produced by the “System Information Collecting Module” 



3.4 Execution of Query Interpreter Module 

“Query Interpreter” detects the existence of vulnerability of the system by applying 
the OVAL queries stored in “Data File” to the system information stored in SQLite 
Database. The return value of the OVAL query statement is CVE ID if the vulnerabil- 
ity is detected. In the case of detection, “Reporting Module” reports this susceptibility 
on the operator’s screen. Fig. 4 is an example report output when “Query Interpreter” 
detects some vulnerability in the system. In this example, detected CVE IDs are 
CAN-2003-0165, CAN-2003 -0547, and so on. 



[ rool#redhat%/home/lhuliac/ecllpsG/worlcspace/oval -project ] RedHat9.0 - PineTorm 4^0.6 P flBBI 



|[rootd'edhjt9 oval projectjB nake run I 


Ijaua -D java. library. path-. 


ny. project. oval. Hain 1 


usioQ non-UTF SQlitp rngine 1 


pjrbiny data file 




'(^dating oval schena 




'■fidarlnq Insert data 




• • itiny probe vector 




•roilect ps info. 




■•'ollect inet listening servers. 1 


•Tiiiect unane info.. 




•ollect passud info. 




collect shadou info. 




• (illect HPM info. 




|> uUect RPHbersionConpare info. 1 


collect file attributes 




Uulnerability is Found 


: i:flN-7flfl3'in6S 


Uulnerability is found 


: CftH-26fl3-e5k7 


Uulnerability is found 


: CftN-2a03-l}Sk8 


Uulnerability is Found 


: CRN 70113 DSi«9 


Uulnerability is found 


: CftH-2893-835k 


Uulnerability is found 


: CPN-2e83-IM87 


Uulnerability is Found 


: CAN 7003 07klf 


Uulnerability is found 


: CAN-2ee3-92k6 


Uulnerability is found 


: CAN-2e83-02k7 







Fig.4. Vulnerability report output when the vulnerabilities are detected in the system 
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4 Comparison with Previous Tools 

We designed and implemented the OVAL-based vulnerability assessment tool operat- 
ing on RedHat Linux Platform. There are some other existing tools used in UNIX- 
like platform sueh as Tiger or SATAN. They have speeifie seripts and speeifie goals. 
Our design follows the standard guideline suggested in the MITRE. So our tool is 
very general-purpose assessment tool and has as similar benefits as OVAL eoneept. 
They are following: 

• A simple and straightforward way to determine if a vulnerability exists on a 
given system 

• A standard, eommon sehema of seeurity-relevant eonfiguration information 

• For eaeh CVE entry, one or more SQL queries preeisely demonstrate that the 
vulnerability exists 

• Reduees need for diselosure of exploit eode as an assessment tool 

• An open alternative to elosed, proprietary, and replieated efforts 

• A eommunity effort of seeurity experts, system administrators, software de- 
velopers, and other experts 

• Freely available vulnerability eontent for publie review, eomment, or 
download from the Internet 

• Industry-endorsed via the OVAL Board and OVAL Community Forum 



5 Conclusions 

OVAL is the eommon language whieh has many benefits and eapabilities of eheeking 
for the presenee of vulnerabilities on a eomputer system by speeifying logieal eondi- 
tions on the values of system eharaeteristies and eonfiguration attributes. The vulner- 
ability assessment tool suggested in this paper is based on OVAL seheme, so it is 
more effieient and flexible. We designed the overall strueture with five modules, one 
Data File and SQLite DBMS. 

Existing assessment tools only eheek the existenee of the vulnerabilities by eheek- 
ing the eheeklists mainly listed in [3]. But the suggested tool ean not only eheek the 
weak points but also define new eheeklists in the form of XML and SQL syntax. 

Traditional tools only eheek the mainly weak points whieh have been aimed to by 
the attaekers. But the suggested tool ean eheek all the weak points registered in CVE 
list at onee. 

In addition to them, beeause existing tools apply somewhat different deseription 
languages with wide variety eaeh other, their deteetion results are not reliable. 
OVAL-based vulnerability methods are getting higher estimation by the seeurity 
experts, so the tools on various OS platforms will be developed eontinuously in the 
future. 
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Abstract. The primary purpose of this study is to examine the models’ 
performance for estimating average delay experienced by the passing vehicles 
at signalized intersection network, and to improve the models’ performance for 
Intelligent Transportation Systems (ITS) application in terms of actuated signal 
operation. Two major problems affected the models’ performance have been 
defined by the empirical analyses in this paper. The first problem is related to 
the time period of delay estimation. The second problem is associated with the 
fact that the observed arrival flow patterns are so different from those applied 
for developing the existing models. This paper presents several methods to 
overcome the problems for estimating the delay by using the existing models. 



1. Introduction 

Many models have been developed for the purpose of delay estimation at signalized 
interseetion network. It is known that the results of the existing models are very 
sensitive to the degree of saturation as well as the arrival flow pattern at the 
interseetions during the time period of interest. This implies that the models’ 
reliability seems to be highly dependent on whether the input variables of the model 
are adequate to deseribe the real traffie eonditions [1, 2, 3, 4]. One main purpose of 
this study is to evaluate five major models for the feasibility of delay estimation for 
urban signalized interseetion network. The five models are Webster, US Highway 
Capaeity Manual (HCM), Transyt-7F, Akeelik, and Hurdle models. Another main 
purpose of this study is to improve the models’ performanees for the purpose of ITS 
applieation sueh as aetuated signal operation. To aeeomplish the study purposes, the 
input variables of the five models were aequired from the traffie data eolleeted from 
the field. The models’ results were eompared with the results obtained from the 
eonventional queuing theory, eumulative arrival and departure teehnique, by using the 
field data. Two study sites in Seoul were seleeted, where traffie states of the two sites 
were different, one was saturated and another was non-saturated. 
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2. Related Work 

The operation of eaeh interseetion approaeh ean be modeled as shown in Figure 1 . In 
the figure, the y-axis is the eumulative vehiele eount (N), and the x-axis is time (t). 
The eurve labeled A(t) shows the eumulative number of arrivals by time t, and D(t) 
shows the eumulative number of departures. In faet, the A(t) eurve does not indieate 
the number of aetual arrivals at the stop line, but the number that would have arrived 
if the signal light had always remained green. The D(t) eurve shows the aetual 
departures from the stop line. When the signal light is red, there are no departures, so 
the D(t) eurve is horizontal. The overall D(t) is the stair-step eurve outlining the 
triangles. In reality, it would begin to eurve upward as vehieles began to move after 
the start of green then after a few seeonds beeome nearly straight with a slope equal to 
the saturation flow [5]. 




In Figure I, the below area of the dashed line is assoeiated with the traffie situation 
that all the arrivals within one signal phase ean pass through the interseetion during 
the same signal phase. This is ealled as non-overflow situation. In this ease, A(t) 
eurve will be the dashed line. The slope of A(t) eurve is the arrival rate, and if this 
rate is eonstant over several signal eyeles, the between the A(t) and D(t) eurves is 
made up of a series of identieal triangles. The total delay per eyele ean be estimated as 
the area of any single triangle. Diving this area by the number of arrivals per eyele 
yields the average delay, that is denoted UD, whieh stands for average uniform delay, 
sinee it was derived under the assumption that vehieles arrive at a uniform rate 
throughout the signal eyele. It should be noted that, in making this assumption, we 
ignore both any random effeets and any pattern imposed on the arrival stream by 
upstream interseetions. In the same figure, the above area of the dashed line is related 
to the traffie situation that some of the arrivals within one signal phase eannot get 
through the interseetion during the same signal phase. This is ealled as over-flow 
situation or over-saturation. The total overflow delay ean be estimated as the area 
between the A(t) eurve and the dashed line. Diving this area by the total number of 
arrivals during the time period whieh the arrival flows exeeed the eapaeity yields the 
average overflow delay, that is denoted OD. The average delay of eaeh signalized 
interseetion approaeh is expressed by the sum of UD, OD, and a eorreetion term 
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which has a negative value typically. The correction term is generally obtained by 
simulation, but its value is relatively too small, so it is ignored for the practical 
purposes. 

Figure 2 presents a good understanding of the relationship between the five 
models’ performances and the degree of saturation, vie, where v = arrival flow and c = 
capacity of intersection approach. Although this figure is an example, it provides very 
useful insights for the features of five models’ performances. As the degree of 
saturation is close to 1.0, the discrepancies of the models’ results are drastically 
increased. The discrepancies are serious in the range of vie from 0.9 to 1.10. From the 
figure we can see that the results of the existing models are very sensitive to the 
degree of saturation as well as the arrival flow pattern at the intersection during the 
time period of interest. In the degree of saturation, c is a manageable variable, but the 
variable v is not, and thus vie cannot be adjusted for the purpose of reducing the 
discrepancies between the five models’ results. 




V/C 

Figure 2. Comparison of the five models’ results for the delay 

It should be noted that US HCM recently presents new model which has improved the 
effects of the arrival flow variations by selecting an appropriate type of arrival flow 
pattern among several predetermined patterns for the analysis [6]. However, this is 
not the only way to improve the HCM model’s performance, so this study has 
selected the model developed in 1994 for solving another problem involved in the 
model. 



3, Evaluation of Existing Models’ Performance 

In order to evaluate the five models’ performance, two study sites were selected. 
Table 1 summarizes traffic and signal conditions of the two study sites. Figure 3 
shows the signal phases of the analysis intersection and the upstream intersections of 
the study site #1. The cycle length of the two intersections is 140 sec. and the 
roadway is 4-lane foe each direction. The travel time between the two intersections 
was 5-minute during the data collection period. Traffic volume and speed were 
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collected at the 15 -minute time interval during morning peak period between 7a.m. 
and 9a.m. Using the traffic data, the A(t) and D(t) curves were constructed as shown 
in Figure 4. In order to match the time of two curves, the travel time between the 
target and upstream intersections was estimated from the observed speed data. 



Table 1. Traffic and signal conditions of two study sites 



study sites 


target Intersection 


upstream 

intersection 


distance 


No.l 


traffic state 


Saturated 


Saturated 


600m 


number of signal 
phases 


4 


cycle length 


Different 


No.2 


traffic state 


non-saturated 


Saturated 


500m 


number of signal 
phases 


4 


cycle length 


Different 



Upstream Intersection 


► 


nl 




11 


-^r 


65 


14 


46 


15 



Target Intersection 


► 


u 




it 


-^r 


58 


21 


46 


15 



Figure 3. The signal phases of study site #1 

In Figure 4, specific traffic counts of y-axis and times of x-axis were not presented, 
since these values are not important at this stage and things to be discussed in this 
paper are related to shape of the two curves. The A(t) and D(t) curves are very similar. 
This result is quite different from that of Figure 1 . The reader may be so confused to 
figure out which one is the reality, but Figure 4 is the case. At the study site #1, the 
target and upstream intersections were saturated. Under the traffic situation, all the 
vehicles passing the upstream intersection traveled at the same speed of the vehicles 
passing the target intersection, and the arrivals of the target intersection could not 
exceed the departures of the intersection. The reader should remember that the A(t) 
curve does not indicate the number of actual arrivals at the stop line, but the number 
that would have arrived if the signal light had always remained green. Flowever, in 
the congested traffic condition, the A(t) curve will not be much changed from the D(t) 
curve of upstream intersection if the signal light had always remained green. As 
described in Section 1, the field delay can be obtained from Figure 4. From now, we 
have to review the problem caused by the length of evaluation time period in using 
the existing models. Figure 5 shows two settings of evaluation time periods, T1 and 
T2. In practice, it is reasonable that the evaluation time period does match with the 
congested time period of the intersection interested, but the evaluation time period has 
been typically defined as 15 minutes or 1 hour. In fact, the length of the evaluation 
time is not a big problem. The problem is the setting of the time period. T1 and T2 are 
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the same length of evaluation time period, but the starting and ending times of the two 
time periods are different. The main difference between the two periods is that T1 
starts at the beginning time of the first phase of signal and T2 terminates at the ending 
time of the final phase of signal. Depending on how to set the evaluation time period, 
the existing models’ result for the delay will be changed significantly. 






Figure 5. The relationship between delay evaluation period, signal phase, and v/c 

As mentioned before, the signal cycle lengths of the intersections at the study site # 1 
are IdOseconds. If the evaluation time period, 15 minutes, is set as Tl, then T1 will be 
from 0 to 900 seconds and will be terminated before the final phase of signal finished. 
For setting the evaluation time period as T2, we have to figure out the time not only 
matched with the ending time of the signal phase but also closed to the 15-minute 
evaluation period. This time is 980 seconds, so T2 is from 80 to 980 (i.e., 140 seconds 
X 7 cycles). Table 2 summarizes the models’ results and the field delay obtained from 
the cumulative arrival and departure technique by using Tl and T2. In the table, the 
field delays obtained from the cumulative arrival and departure technique by using Tl 
and T2 are very similar. Flowever, the models’ results for overflow delay obtained by 
using the two periods are quite different, while the results for uniform delay of the 
two periods are identical with the exception of FICM model. Although the time 
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lengths of the two evaluation periods are equal, the overflow delays obtained by using 
T1 are much greater than those of T2. The evaluation period T1 terminates before the 
signal cycle finished so that v/c is definitely overestimated. From the results of Table 
2, it is confirmed that the models’ results are very sensitive to v/c and it is mainly 
dependent upon the ending time of evaluation time period. More specifically, the 
degree of saturation, v/c, is mainly determined by the fact whether the ending time of 
evaluation period agrees with the ending time of signal cycle. The overall models’ 
results for the delay are much greater than those of the field observations. In order to 
overcome the problems of both T1 and T2, a new evaluation time period that includes 
both T1 and T2 has been proposed in this study. The new evaluation period starts at 
the beginning time of T1 and terminates at the ending time of T2, so the new 
evaluation time period is longer than both T1 and T2. 

Table 2. The comparison of the models’ results with the field observation of the study site #1 



Evaluation 
period is 
T1 from 0 to 
900 sec. 


Models 


UD 


OD 


UD + 
OD 


Field 

Delay 


Webster 


41.00 


- 


- 


v/c = 
1.31 

109.09 

(sec) 


T-7F 


41.00 


142.61 


183.61 


Akcelik 


41.00 


142.65 


183.65 


HCM 


39.97 


189.59 


229.56 


Flurdle 


41.00 


140.17 


181.17 


Evaluation 
period is 
T2 from 80 to 
980 sec. 


Models 


UD 


OD 


UD + 
OD 


Field 

Delay 


Webster 


41.00 


- 


- 


v/c = 
1.20 

107.07 

(sec) 


T-7F 


41.00 


93.47 


134.47 


Akcelik 


41.00 


92.91 


133.91 


HCM 


36.30 


104.23 


140.53 


Hurdle 


41.00 


90.06 


131.06 



Table 3. The comparison of the models’ results with the field observation of the study site #1 



Evaluation 

period 

from 0 to 980 
sec. 


Models 


UD 


OD 


UD + 
OD 


Field 

Delay 


Webster 


49.25 


- 


- 


v/c = 
1.24 

107.55 

(sec) 


T-7F 


49.25 


109.56 


158.81 


Akcelik 


49.25 


109.26 


158.51 


HCM 


37.44 


129.27 


167.11 


Hurdle 


41.00 


106.57 


147.57 



Comparing the results in Tables 2 and 3, it is very clear that T1 does overestimate the 
overflow delay, since the overflow delays obtained by using T1 are still much greater 
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than those of the new period even though the new period is longer than Tl. In general, 
the models’ results are very fluetuated by the ehange of evaluation time period, while 
the field delays are eonsistently ehanged. 

The models’ overflow delays are persistently greater than the field observations. The 
reason for this ean be found in Figure 6. The A(t) obtained from the field observation 
is stair step eurve, while the A(t) of the existing models forms a smooth eurve. It is 
interesting that the diserepaney of the overflow delay between the two eurves is 
almost equal to the uniform delay, UD. Thus, if the uniform delay is subtraeted from 
the total delay of the models, the models’ results will be matehed with the field values 
reasonably well. 



Overestimated OD 




Time (sec) 



Figure 6. The discrepancy of A(t) 

Figure 7 shows the signal phases of study site #2. The study site #2 is not saturated 
interseetions. The travel time between the analysis interseetion and the upstream 
interseetion was 5. 3 -minute during the data eolleetion period. Traffie volume and 
speed were eolleeted at the 15-minute time interval during morning peak period 
between 7a.m. and 9a.m. Using the traffie data, the A(t) and D(t) eurves were 
eonstrueted as shown in Figure 8. 



Target Intersection 




u 


^1 




r 


27 


25 


47 


41 



Upstream Intersection 


^ 














58 


21 


46 


15 



Figure 7. The signal phases of study site #2. 

In Figure 8, the shaded area marked by a solid line is the observed delay and the area 
represented by a dashed line is the estimated delay of the models. The two areas form 
the diamond shape that is quite different from the triangle as shown in Figure 1. 
Anyway, the shaded area is larger than the estimated area of model. The differenee 
between two areas is gradually redueed over several signal eyeles. Then, as the signal 
eyele runs over and over again, the two areas will be eonverged to the same size. The 
signal eyele lengths of the interseetions at the study site # 2 are IdOseeonds. The 
delay of study site #2 has been estimated by the same proeedures applied for the study 
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site #1. Tables 4 and 5 summarize the models’ results and the field delay obtained by 
using three different evaluation time periods. As shown in Table 4, the evaluation 
period T1 terminates before the signal eyele finished, so v/c is exeeeded to 1.0 
although the site is not saturated. 




Figure 8. A(t) and D(t) curves of study site #2 

Correspondingly, the models have produeed the overflow delay. Using the period T2, 
the models still produee the overflow delay even though v/c is not exeeeded 1 .0, but 
the delay is very small. However, the models results obtained by T2 are mueh less 
than the field observations. 



Table 4. The comparison of the models’ results with the field observation of the study site #2 



Evaluation 
period is 
T1 from 0 to 
900 see. 


Models 


UD 


OD 


UD-t 

OD 


Field 

Delay 


Webster 


49.50 


- 


49.50 


v/e= 1.07 
58.99(see) 


T-7F 


49.50 


38.10 


87.60 


Akeelik 


49.50 


34.95 


84.45 


HCM 


38.77 


33.94 


72.71 


Hurdle 


49.50 


32.07 


81.57 


Evaluation 
period is 
T2 from 80 to 
980 see. 


Models 


UD 


OD 


UD-t 

OD 


Field 

Delay 


Webster 


49.50 


- 


49.50 


v/e = 0.95 
62.57(see) 


T-7F 


49.50 


6.78 


56.28 


Akeelik 


49.50 


0.90 


50.40 


HCM 


36.. 83 


4.49 


41.32 


Hurdle 


49.50 


- 


49.50 



In Table 5, all the models with the exeeption of Transty-7F tend to underestimate the 
uniform delay of non-saturated interseetion by using the evaluation time period 
proposed in this study. However, Transty-7F and Akeelik models produee the 
reasonable results that are very elose to the field observation, so the two models seem 
to be good to estimate the delay of the non-saturated interseetions. 
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Table 5. The comparison of the models’ results with the field observation of the study site #2 



Evaluation 


Models 


UD 


OD 


UD + 
OD 


Field 

Delay 


period 


Webster 


49.71 


- 


49.71 




from 0 to 980 


T-7F 


49.71 


18.03 


67.74 


v/c = 1.01 


sec. 


Akcelik 


49.71 


11.01 


60.72 






HCM 


37.78 


13.52 


51.30 


61.67(sec) 




Hurdle 


49.50 


4.93 


54.43 





4, Conclusions 

The primary purpose of this paper is to examine the models’ performance for 
estimating average delay experienced by the passing vehicles at urban signalized 
intersection network, and to present the method for improving the models’ 
performance. Two study sites in Seoul were selected, where traffic states of the two 
sites were different, one was saturated and another was non-saturated. From the 
empirical analyses, it was reconfirmed that the results of the existing models are very 
sensitive to the degree of saturation as well as the arrival flow pattern at the 
intersections during the time period of interest. Depending on how to set the 
evaluation time period, the existing models’ results for the delay have been changed 
significantly. The field delays obtained from the cumulative arrival and departure 
technique by using T1 and T2 are very similar. Flowever, the models’ results for 
overflow delay obtained by using the two periods are quite different, while the results 
for uniform delay of the two periods are identical with the exception of FICM model. 
Although the time lengths of the two evaluation periods are equal, the overflow 
delays obtained by using T1 are much greater than those of T2. In order to improve 
the problem associated with the setting of evaluation time period, a new period that 
includes both T1 and T2 has been proposed in this study. The models performances 
have been somewhat improved by using the new period. 
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Abstract. Role-Based Access Control (RBAC) is a method to manage and con- 
trol a host in a distributed manner by applying the rules to the users on a host. 
This paper proposes a rule based intrusion protection system based on RBAC. 
It admits or rejects users to access the network resources by applying the rules 
to the users on a network. Proposed network intrusion protection system has 
been designed and implemented to have menu-based interface, so it is very 
convenient to users. 



1 Introduction 

The purpose of RBAC policy is to protect the computing and transmission resources 
from being updated, exhibited or destmcted caused by the unprivileged access[l]. 
And the purpose of the intrusion protection system is to protect the network re- 
sources from the access of outside users and usually it is located on the local network 
server[2]. 

Basically the firewall operates closely related to the router program. It tests the net- 
work packets, decides whether it receives the packets or not, and filters some packets. 
The firewall works interactively with the proxy server whose role is to resolve the 
requests to the network on behalf of users, or it rather includes the role of proxy 
server in itself [2]. The firewalls are generally used in the corporation or the public 
organizations to filter the access from specific user(s) or host(s). But, it is not used in 
some specific purposed network such as proprietary PC room or Internet cafe. 

Because buying a firewall of private security enterprise costs some price and needs 
some person to operate it. On the other hand, using the freeware based on Linux OS 
such as ipfwadm, ipchains and iptables [3] is too difficult for non-experts to configure 
the filtering-policies properly. 

Intrusion Protection System (IPS) has mixed characteristics of firewall and Intrusion 
Detection System. This paper design and implement the IPS based on RBAC method. 
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It provides easy configuration interfaces to the operator. Additionally, it costs down 
the security tool and makes it easy for the operator to administrate the users on 
hosts. 

In Section 2, we review the related works with RBAC method and network based 
Intrusion Protection System. And the efficiency of resource management is addressed 
under the condition that the RBAC is applied to the intrusion protection system. In 
Section 3, we designed the RBAC based Intrusion Protection System and in Section 4, 
we show the implementation-related items. Section 5 is the conclusion. 



2 Related Works 

2.1 RBAC Method 

The basic concept of RBAC is to prohibit the information resources of company or 
organization from an unauthorized user. In RBAC, the access authority is given to the 
roles and each user is imposed on a proper role. If some user is imposed on some role, 
that user can access the minimum subset of total resources properly for that role [4]. 
This approach of authority management has some advantages: it simplifies an admini- 
stration of a system security, and provides flexibility for a company to implement its 
specific security policy of its own. 

Three DBs are necessary for the operation of RBAC. They are shown in Figure 1. In 
Operations-DB, the usage and execution authorities of the processes and resources 
are defined. They are used to decide whether a user or system daemon can use the 
resources or not. Roles-DB classifies the Operations-DB records according to the 
roles. And the third DB, Users-DB defines the roles permitted to each users. 

The processes of O from Operations-DB in Figure 1 are to add the combined and 
classified Operations-DB records to the Roles-DB. The processes of © are to assign 
an authority based on Roles-DB to the system users. This process makes the man- 
agement of user's authority very simple by distributing the responsibility of adminis- 
trator management to each user. 




Fig-1- The necessary DBs in RBAC operation 
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2.2 The Definition of IPS 

The IPS is one of the network components to protect the internal networks from the 
attacks initiated from the outside networks. The IPS uses policies to protect the inter- 
nal information from incoming or updating by the unauthorized attack. It can be a 
type of software and/or hardware. It is an active protection process to prohibit from 
incoming of illegal traffics and permit only the authorized traffics [5]. The purpose of 
the IPS is an blocking of illegal external attack, preventing the loss, destroy and 
change of internal information from non-trusted user and hackers through Internet, 
and helping internal information to be provided to the outside safely. 

IPS is located in the rear section of router generally. It permits or denies the for- 
warded packets to the router by analyzing and comparing with filter-rules. 



3 The Design of RBAC-based IPS 



3.1 Operational Steps 

We made the operational steps of RBAC-based IPS as in Figure 2. The first opera- 
tional step is to collect all the packets going through its own network by set the Net- 
work Interface Card to promiscuous mode. 



HOST-baaed RBAC kitniakiii 
Protection System 




Fig.2. Operational steps in RBAC-based IPS 



Figure 2 shows the procedures of RBAC-based Network Intrusion Protection System 
while the dummy hubs are used in the network. In procedure O, host A transmits a 
request packet to access to the illegal site or program. Or host A may transmit a 
packet to reply to backdoor client programs in the remote site. Anyway, transmitted 
packet from host A is forwarded to the hub and router to go to the external networks. 
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Simultaneously the transmitted packet is broadcasted to the other hosts in a internal 
network. When host B receives that packet, it discards it silently because it is not 
destined to MAC address of itself But, host X is running a Intrusion Protection Sys- 
tem with a promiscuous mode. So host X accepts all the packets going through the 
local network. In procedure ©, RBAC -based Network Intrusion Protection System 
verifies the packets based on the predefined roles and sends an "ICMP Protocol Un- 
reachable" message to the host A when the packets violate the Roles. These roles are 
stored in RBAC -based DBs in the form of rules. When host A receives an ICMP 
message, it conceives that there is some trouble or erroneous status at a host which 
has requested a connection. Therefore, host A notifies that situation the remote host. 
Procedure © shows that host A ignores the request packets from the erroneous host 
afterwards. 

And for the switch-based network environment, RBAC -based Intrusion Protection 
System uses an arp-spoofing capability to forward the packets that is going to the 
router to itself Forwarded packets are verified based on the Roles, which was estab- 
lished in the IPS. If the verification is failed, IPS sends an "ICMP Protocol Unreach- 
able" message to the host A. Then that host A processes an ICMP message as a nor- 
mal response. The difference with dummy hub environment is that host A doesn't 
receive any remote packet from outside network because the reply packet transmitted 
from host A didn’t go out to the outside world actually. Applying these procedures, 
RBAC -based Network Intrusion Protection Systems can restrict the usage of network 
resources based on the predefined Roles and assign a specific privilege to the hosts in 
the inner network. 



3.2 Logging and Schednling 

The logs on the logging can be classified into three: site-log, program-log and back- 
door-log. Site-log stores the list of illegal sites and domains. Program-log has infor- 
mation of non-permitted network programs. And backdoor-log is used as a backdoor 
protection log. The Log information filtered by RBAC in filtering module helps an 
network administrator find out the status of network resources. 



4 Implementation of RBAC-based IPS 

RBAC -based Intrusion Protection System is implemented on the Linux OS. Figure 3 
and 4 shows the implemented command window of RBAC-based IPS. An administra- 
tor specifies the IP addresses of Managed Resources and the range of subnets for the 
managed ports in a file. Then the parsing module of IPS reads and parses the contents 
of a predefined file and filters all the packets by monitoring and capturing based on 
the policies of RBAC. Figure 3 is a window showing the definition of check items of 
packet filtering and Figure 4 is one of the active shots of RBAC-based Intrusion Pro- 
tection System. 
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Fig.3. Managed Port Information Table of RBAC-based IPS 



JIN - PineTerm v2.0.B B BEID 

[rootejin rbac-ips]# ./rbac-ips 
Usage : rbac-ips [-clapstne] 

-a : blocking, writing both 

-c : blocking only 

-1 : writing only 

-X : blocking all tcp seruices except allowed ports 

-p : blocking pop3(blocking nail receive) 

-s : blocking sntp(blocking nain send) 

-t : stopping all internet seruices 

-n : nonitoring 

-e count : the number of blocking engine thread 

-i interface : Ethernet interface 
-w : Support switching envirunnent 

Thread 1 start... 

Enable node for 43 minutes 
Thread 2 start... 

Enable node for 43 minutes 
Thread 3 start... 

Enable node for 43 minutes 



Fig-4. Menus and active shot of RBAC-base IPS 



5 Conclusions 

This paper designed and implemented an Intrusion Protection System, which man- 
ages hosts based on the predefined roles by applying the roles to the hosts in a net- 
work, not by applying the roles to the individual users. RBAC-based Intrusion Protec- 
tion System has an advantage that it can be applied in one of the hosts as well as in 
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the router in a network. Implemented IPS also ineludes the function of collecting the 
packets in a switched network using an arp-spoofing method. Additionally, proposed 
Intrasion Protection System has an effective filtering capability only by applying the 
predefined roles to the Hosts-DB instead of establishing the complex packet filtering 
rules by users. 

Its application area is the companies, proprietary PC rooms or Internet cafe that re- 
quires security settings for the individual hosts to restrict the network resources. 
Though we use an arp-spoofing technique to collect in the switched network envi- 
ronment, it can be a burden to the network. So the further study to lessen the load of 
arp packets would be required. 
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Abstract. In this paper we propose the effect on the RF bandwidth when the 
DARC data signal is added to the ordinary FM broadcasting signal. Generally, 
the bandwidth of the commercial FM broadcasting signal is strictly restricted to 
200KHz. Hence, even though the DARC data signal is added at the ordinary 
stereophonic FM signal, the required bandwidth should not exceed 200KHz. 
The simulation results show that even in the worst case, the required bandwidth 
is about 184 KHz, and the rest of 16 KHz bandwidth could be used for other 
FM data broadcasting services. 



1. Introduction 

After the service of the FM multiplex broadcasting called DARC was started at the 
mid of 1990, it has been much interested in the mobile data broadcasting service. By 
virtue of the broadcasting characteristics of the FM audio service, DARC system 
makes the information data broadcasting service to the many customers distributed in 
the wide area to be possible. Service area has been also widely extended from the 
traffic, DGPS (Differential Global Positioning System), weather, news to ITS 
(Intelligent Transportation System), Telematics [1, 2]. 

The detail specification of DARC system is known by ITU-R [3], and the 
performance analyses of the DARC system on the several constituent parts such as 
level-controlled MSK, the immunity on the multi-path fading environments, or error 
correction ability have been carried out at the papers [4,5]. 

RF bandwidth of the commercial FM broadcasting signals is usually set to 200KHz. 
Even though it is known that DARC service is possible within the bandwidth for the 
FM broadcasting service, there has been no work in which precise analysis on the RF 
bandwidth of the DARC system is treated. And this work is useful in the aspect of the 
efficient usage of the valuable frequency resources. 

In this paper we analyze the effect on the RF bandwidth when DARC data signal is 
added to the ordinary FM broadcasting signals. It is performed by the computer 
simulation in which two systems are compared; one is ordinary FM broadcasting 
system and the other is the DARC system. Level controller and band-pass filter that 
are considered in this paper meet the requirement described in the DARC 
specification [3]. 
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2. DARC System Model 

We fully follow the specification of the FM system and DARC system [3] in order to 
estimate precisely the RF bandwidth of the ordinary FM broadcasting system and the 
FM system including DARC data. Fig.l shows the block diagram of the DARC 
system considered in this paper. The system consists of the stereophonic matrix 
containing L+R and L-R signals, 19 KFlz pilot generator, frequency multiplier (X2 
and X4), data signal generator, FM generator, and the modulation level is controlled 
according to the magnitude of the L-R signal. 




Fig. 1 The block diagram of the DARC system 

The inputs of the stereo signal generator are L channel and R channel audio signals. 
At the matrix block, L-l-R signal that is sum of the two audio signals and L-R signal 
that is the difference between the two signals are generated. Then, L-R signal is 
frequency shifted by multiplying 38KFlz single-tone carrier. 

LMSK generator controls the magnitude of the DARC data signal according to the 
magnitude of the L-R signal. Its output is MSK modulated and band-pass fdtered. The 
frequency of the sub-carrier for MSK modulation is 76KFIz. In the specification of the 
DARC system [3], the upper bound and lower bound of the frequency response of the 
DARC data signals are described, which is shown at Fig. 2. To meet the requirement 
of the frequency response, we have adopted Chebyshef type-2 filter with order 8 [7], 
whose filter coefficients are chosen as shown in table 1 . 
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Fig. 2 Filter requirement of the DARC system and the Chebyshef type-2 filter 



Table 1 The coefficients of the proposed Chebyshef type-2 filter 



Numerator Coefficients 


Denominator Coefficients 


0.0009641 


1 


-0.012423622 


-13.40536282 


0.077007238 


85.71009262 


-0.305025502 


-346.663852 


0.865038284 


992.139065 


-1.8644737 


-2129.490727 


3.162690668 


3544.666673 


-4.310679576 


-4666.626359 


4.773804224 


4910.109636 


-4.310679576 


-4142.52789 


3.162690668 


2793.189761 


-1.8644737 


-1489.580883 


0.865038284 


616.0639585 


-0.305025502 


-191.086542 


0.0009641 


0.38564381 
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3. Simulation and Results 

3.1 The Characteristics of the Base Band Signal 

We explore the effeets on the frequeney responses of the L+R and L-R signals 
aeeording to the eorrelation of the L and R ehannel signals. The eorrelation 
eoeffieient cf of the input signals is defined as follows. 

^ cov {L,R) 

(L , L ) • cov (^R ,R^ 

Here, Cov is the eovarianee, and L and R is ehannel audio signals, respeetively. 
Beeause the DARC system adopts the LMSK modulation method, in whieh the 
injeetion level is eontrolled by the magnitude of the L-R signal, the level is adjusted 
aeeording to the eorrelation eoeffieient of the L and R ehannel signals. Generally, the 
bandwidth of the L and R ehannel signals for FM broadeasting should not exeeed 
15KHz, so two stereophonie signals whose bandwidths are 15KHz are generated. 
Table 2 shows the relation between the eorrelation eoeffieients and the average 
injeetion level, where the following two input eases are eonsidered. First, the 
eorrelation of the two stereophonie signals is 0.9, and seeond, the eorrelation is zero 
so that the two audio input signals are independent. 

Fig. 3 shows, by base-band frequeney speetmm, the relation between the magnitudes 
of the LMSK modulated DARC signals and the eorrelation eoeffieient between the L 
and R ehannel input signals. As the eorrelation eoeffieient is higher, the magnitude of 
the frequeney speetmm of the L-R signals is also inereased. 

Table 2 The correlation coefficient and the average injection level of the DARC signals 



cf 


Average Injeetion Level for DARC 


0.9 


6.67 % 


0 


8.91 % 





(b) The cf of the L, R channel signals is zero 



Fig. 3 Base-band frequency spectrums of the ordinary FM signals and DARC signal 
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3.2 FM Bandwidth in the Stereophonic Signals 

The bandwidth of a signal can be defined that the bandwidth in which 99% of the 
total power spectrum of the signal is contained [8]. In this paper, 4 cases are 
considered and the detail simulation conditions are described as follows. 

Case 1 : the correlation coefficient of the L and R channel signals is 0.9. 

Case 2: the correlation coefficient of the Land R channel signals is zero. 

Case 3 : L and R channel signals are independent of each other but their most powers 
are concentrated within the bandwidth between lOKHz and 15KHz. 

Case 4: Both of the L and R channel signals are 15KHz single tones with random 
phase. 

At Table 3, the estimated RF bandwidths according to the above four input conditions 
are present. From the results, it is known that the case 4 requires the largest bandwidth 
and the required bandwidth gets smaller as the correlation coefficient between L and 
R signals increases. 

Fig. 4 shows the frequency spectrum when the maximum frequency deviation /S.f is 
set to 75KFIz and the stereophonic FM input signals fall under one of following three 
conditions; case 1, 2, and 4. The graphs show that among them the 4(c) requires 
largest bandwidth and 4(a) occupies minimum bandwidth. 

As mentioned at the above section, because the magnitude of the L-R signal is 
frequency shifted by a 38KFIz carrier, frequency shifted L-R signals put more effect 
on the FM modulated total bandwidth than L-l-R signals do. Flence FM modulated 
total bandwidth get larger as the magnitude of the L-R is larger. 

Table 3 The required frequency bandwidths of the ordinary FM system 



Bandwidth (KFIz) 


Case 1, c/= 0.9 


84.6 


Case 2, c/=0 


93.8 


Case 3, 10-15KHz 


147 


Case 4, 15KHz 


182 



3.3 FM Bandwidth with the DARC Data Signal 

Fig. 5 shows the RF spectrums under the condition of the input sources are same as 
the case of Fig. 4 when the maximum frequency deviation, A/^ , is 75KFIz and the 
DARC data signal is added to the stereophonic FM signal. From the graphs, we can 
see that case 3 gets the maximum frequency bandwidth of 1 84KFIz. 
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(a) The cf of the L,R channel signals is 0.9 (b)The cf of the L,R channel signals is zero 




Frequency (kHz) 

(c) Both of the L and R channel signals are ISKFlz single tones 



Fig. 4 Frequency spectrums of the ordinary FM systems 



4, Conclusions 

In this paper we analyze the effect on the RF bandwidth when the DARC data is 
added to FM broadcasting signals. As mentioned above, the FM broadcasting system 
and DARC data adding mechanism that is presented in this paper fully follows the 
ITU-R specification. Our research is on that how much the RF bandwidths of the 
ordinary FM broadcasting system and DARC system are affected by the statistics of 
the stereophonic input signals for FM broadcasting. The results show that, in the 
worst case, RF bandwidths of the ordinary FM broadcasting system and the DARC 
system are 182KHz and 184KHz, respectively. It means that in the ordinary FM 
broadcasting system there is still enough frequency space for the additional data 
services. Moreover the total RF bandwidth does not exceed 200KHz, that is the 
requirement of the FM broadcasting service, even though DARC data signal is added 
to the ordinary FM broadcasting signals. It can be useful baseline results for the 
efficient usage of the frequency resources such as additional data broadcasting service 
in the ordinary FM frequency band. 






682 



S.W. Lee, K.J. Han, and K.C. Whang 





(a) The cf of the L, R channel signals is 0.9 (b) The cf of the L, R channel signals is zero 




(c) Both of the L and R channel signals are ISKHz single tones 
Fig. 5 Frequency spectrums when the DARC data signals are added at the ordinary FM signals 



References 



[1] Chen Wei and Shen-Weng Hong, An FM Sub-carrier Data Broadcasting System On Bus 
Transport Indication System, IEEE Transactions on Broadcasting, Vol. 47. No.l Mar. 2001 

[2] Jonguk Park, Jeongho Joh, Hyungchul Lim, Pilho Park and Sangwoon Lee, “THE 
DEVELOPMENT OF DGPS SERVICE SYSTEM USING FM DARC IN KOREA”, 
Proceedings on the Federation of International Federation of Surveyors, May 2001 

[3] ITU-R BS Recommendation DOC. 1 194, 1995. 

[4] Marion J. de Ridder-de Groote and Ramjee Prasad, Jan. H.Bom, “Analysis of New Methods 
for Broadcasting Digital Data to Mobile Terminal over an FM-Channel”, IEEE Transactions 
on Broadcasting, Vol. 40. No. 1. Mar. 1994 

[5] K. Kuroda and M Takada, T. Isobe, “Transmission Scheme of of High-capacity FM 
Multiplex Broadcasting System”, IEEE Transactions on Broadcasting, Vol. 42. No. 3 Sep. 
1996 







Effective FM Bandwidth Estimate Scheme 



683 



[6] Papoulis, Probability, Random Variables, and Stochastic Processes. Third Edition, McGraw 
Hill, 1991. 

[7] Michel C. Jeruchim, Philip Balaban, and K. Sam Shanmugan, Simulation of 
Communication Systems, Plenum Press, 1992. 

[8] E B. Crutchfield, NAB Engineering Handbook 7th edition, NAB, 1985. 

[9] ITU-R BS Recommendation DOC. 643, 1986 

[10] Ferrel G. Stremler, Introduction to Communication Systems, Third Edition, Addison- 
Wesley, 1990. 

[11] J. R. Carson, “Notes on the Theory of Modulation,” Proceedings of the IEEE, Vol. 51 
(1951) :pp 893-896. 




Enhanced Algorithm of TCP Performance on Handover 
in Wireless Internet Networks 



Dong Chun Lee’, Hong- Jin Kim^, and Jae Young Koh^ 



‘ Dept, of Computer Science Howon University, Korea 
ldch@sunny . howon . ac . kr 
^ Dept, of Computer Information, KyungWon College, Korea 
^Director, Principal Member of Eng. Staff, National Security Research Institute, Korea 



Abstract. In the TCP over Wireless Internet Networks (WINs), TCP responds 
to losses such as high hit errors and handovers by invoking congestion control 
and avoidance algorithms. In this paper we propose new handover notification 
algorithm that is to send an explicit handover notification message to the source 
host from mobile host when occurring to handover. Upon receipt of explicit 
handover notification, the source host enters persist mode. This way, data 
transmissions at the source host during handover are frozen. In numerical result, 
proposed algorithm provides a little performance improvement compared with 
general TCP, and expects to greater performance improvements while having 
frequent handover in WINs. 



1 Introduction 

The existing Internet environment has been ehanged to single networks, integrated 
wire and wireless, due to appearanee of wireless networks. The first issue we faeed 
from this Internet integrated networks is terminal mobility. Aeeordingly, it has been 
studying very aetively about mobile IP whieh ean eorrespond to terminal mobility, 
through reinforee of addressing and routing. Exeept this network hierarehy issue, we 
still have issues sueh as validation of TCP efficieney to guarantee reliability of eon- 
neetion between End-to-End [2,3]. TCP dominated in eommunieation environment is 
suitable to traditional network which is composed of wire network and fixed host. 
Application of TCP to wireless network which is different from wired network such 
as bandwidth, high delay, discrete, bit error, disconnection and handover will cause 
falling in efficiency of End-to -End throughout by unnecessary calling of mechanism 

[4, 7], 

TCP treats packet loss caused by bit error rate in network as congestion control 
mechanism; because it considers that packet loss is from congestion. This mistreat- 
ment of TCP results in efficiency falling in throughout. If TCP sender finds packet 
loss, he lessens transport window size and resend lost packet. Disorder control or 
avoidance and packet loss in network caused by disconnection in handover high bit 
error rate. If the packet loss recovery mechanism mentioned in advance is applied to 
network, it causes unnecessary performance degradation. In addition, if you apply 
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TCP protocol to mobile host, it results in unnecessary decrease in terms of bandwidth 
usage and performance degradation through handling rate decrease and high delay. 
To make environment for mobile computing, we should find location of host which 
change its connection points, and keep connection with changeable connection during 
communication. The new revised activation method was invented and was studied. 
This activation isn’t impacted to the existing TCP, however it considers character of 
wireless networks. 

There are several methods to improve End-to-end throughput in wireless network in 
different perspectives. They are End-to-End protocol. Split connection protocol and 
link hierarchy protocol. End-to-End protocol is the protocol which sender knows 
existence of network, and it uses mainly Selective Acknowledge which can recover 
several packet in window not to depend on timeout while resending packet and ELN 
which can protect call of disorder control mechanism, by informing packet loss rea- 
son is not disorder but network characters. As we may guess the meaning of terms. 
Split-connection is the way to apply proper protocol after splitting the connection of 
wire and wireless network. In other words, it is split into wire network between mo- 
bile host and Base Station (BS) and wireless network between FH and BS, Link hier- 
archy protocol as providing local reliability, uses combination of ARQ (Automatic 
Repeat request) and FEC (Forward Error Correction) which bring TCP improvement 
by hiding loss related wireless network in network hierarchy such as TCP. 

This paper proposes the way to improve TCP performance on handover in WINs. 
When handover happens, the Mobile Flost (MH) which recognizes handover sends 
explicit handover notification packet formatted as ICMP to the fixed host and informs 
handover. Through this process, it protects not to happen retransmission timeout and 
over crowded control process on force. 



2 Related Work 

Fast retransmission algorithm is the way to take lesser time of handover by quick 
resending process of fixed host through process that MFI sends three overlapped Acks 
allocated last received pack to Fixed Flost (FH) after handover completion [5]. The 
existed solution for improvement of End-to-End efficiency is for transported protocol 
to re-start sending data after handover completion. By this, re-sender timer of fixed 
host can escape time out. The advantage of quick re-sending algorithm is to be re- 
quested a least change for software at end host. Mobile IP is revised to send available 
handover at the one layer in protocol hierarchy, and also be revised TCP to call quick 
re-sending procedure. It doesn't depend on other medium router in wireless network 
[7]. After fist resending, it goes through complicated control procedure by closing 
window and using slow start algorithms. Therefore both rest networks and mobile 
host can escape disorder of cell. 

Probe [3] is the way to make resending quickly about related packet by mobiles 
sending three Negative Acks to the FH. Probe algorithms is the Split-connection 
protocol hiding MH movement from TCP to decrease impact of handover in TCP 
function. 
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Probe chooses sign notifying handover in a layer of network as fast retransmission 
algorithms. This algorithm tries to solve disconnection over handover by using in- 
formation from Mobile IP to find completion of handover. 

Snoop protocol is the split connection way providing new routing protocol to de- 
crease data loss on handover. If handover happens, it can skip the process forwarding 
data as new BS from primary BS, not like other protocol. In a result, it can remove 
data handover delay. 

In previous handover algorithms. Fast Retransmission algorithm [4] as End-to-End 
protocol which fixed host, sender node, knows wireless existence reopens sending 
quickly to decrease stopping time over re-opening of communication in the network 
level after hand off completion, by measurement of impact from mobile movement 
over throughput and delay and recognition of reasons of function loss. In other words, 
it is the way to decrease time on hand off after completion of handover procedure. 
MFI starts resending process of lost segment not waiting for re-sender timer after 
sending three overlapped Ack allocated in last receiving packet or for re-sender timer 
after sending to the FIT. Because of fast retransmission call resending only when 
informed handover, it is only least change at software in existed FIT. In addition, it 
doesn't depend on based network or medium router. ITowever, it is only providing 
solution for packet loss due to disconnection in handover and disregards packet loss 
by high bit error rate of network in itself Besides, after completion of handover, it 
takes round trip time to arrive completed bandwidth and it makes sending rate de- 
crease by shrinking sending window size. Probe algorithms is the split connection 
protocol hiding movement of MIT from TCP to decrease impact of handover in TCP 
function. It is the way to rise quick resending to corresponded packet by MIT sending 
three Negative Acks to the FIT after completion of handover. This algorithm has 
problems which it should always storage Ack packet and sends them and it cannot 
guarantee of transparency in hierarchy in the BS. Snoop protocol [2] is the split 
connection way to provide new routing protocol to decrease data loss on handover. 
This algorithm can remove delay over handover and improve efficiency in case of 
high bit error rate. ITowever it has some deficits such as too much information related 
with handover process and heavy burden of unnecessary data and buffering. 

The above algorithms to recover TCP efficiency decrease due to handover tried to 
improve End-to-End function through minimization of FH TCP timeout case and 
decrease delay time after completion of handover. However it still takes several round 
trip time to get completed bandwidth, and it has issue to resend lost data due to time 
out during sending data buffered in BS. 



3 New Handover Algorithms 

When the handover occur in a cell, MH send explicit handover notification packet 
with Internet control message protocol structure to FH and old BS, and indicate to 
handover start explicitly. Then it cannot be taken place the timeout of FH retransmis- 
sion timer and the procedure of congestion control. The explicit handoff notification 
packet include the window size where window size mean to buffer size of received 
host and the address of new BS which it is after the handover in MH. This make 
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sender FH to persist mode, and sending host eannot transmit all paekets basieally. 
Proeedure of paeket ehange in handover is the following figure 1 . 



MH New BS Old BS FH 




Fig. 1. Message changing procedure during the handover 



In the handover proeess, the MFI reeeive new beaeon, and if it know handover start, 
new BS send Greet paeket ineluding IP address of FFI with old BS and its IP address. 
FFI send explieit handover notifieation paeket with window information and address 
of new BS of MFI. New BS send greet Aek paeket to the MFI, and old BS transmit 
buffering data to MH through new BS. If the handover is over, the MH sends infor- 
mation of the great sequenee number that reeeived from old BS, and the FH ean be 
reeeived paeket data. When the FH reeeive explieit handover notifieation paeket, it 
ehanges value of sending window size to 0, and waits for the Aek message of MH 
that handover will over after ehange to persist mode in itself In persist mode, it halts 
to retransmission timer and states that is in FHs, and sends Probe paeket period that 
inquire to inereasing window size for sending paeket to reeeiver node using the per- 
sist mode, whieh Probe paeket proeess in the new BS. And when Aek message from 
MH send to new BS, FH ehange persist mode to normal mode, whieh have value 
advertising window before handover. Then the FH send paeket as sending window 
size that use before handover. Explieit handover notifieation paeket happens to one 
time during handover, and it has no mueh effeet on traffie of the path. This algorithm 
is following 
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Algorithm Handover 
Procedure MH 
while 

If Handover happen to Then 
make explieit handover notification packet; 
explicit handover notification packet; 
mobile host IP, fixed host IP, New BS IP, 
mobile host's advertising window = 0; 
send explicit handover notification packet; 

If handover is over Then 
make Ack packet; 

include information of sequence number from old BS; 
send Ack packet; 

Procedure New BS 
while 

If old BS's buffering packet receive Then 
send buffering packet; 

If Probe packet receive Then 
Persist reply mode; 

Procedure FH 
while 

If explicit handover notification packet receive Then 
enter Persist mode; 

If Ack packet receive from MH Then 
come out form the Persist mode; 
end FH; 
end New BS; 
end 



4 Simulation and Analysis 

The simulation was performed using NetSimulator and IBM compatible PCs with 
Pentium. For the purpose of this model, we have assumed that simulation is per- 
formed base on network model that configure switching system with linking to nodes. 
Fig. 2 shows that this network model consists of FH, MH, and BS in mobile envi- 
ronments. In this model, it assume that wire link has 56 Kbps, wireless link has 
19.2Kbps, and maximum window size is 32, and packet size is 128 Bytes, and TCP in 
FH make use of Tahoe TCP protocol. In order to ensure proper packet processing, we 
applied the information used for actual simulations, which was computed using the 
Poisson distribution, equally to mobile networks. Simulation carried to 7 times hand- 
over by seconds during 100 seconds to compare to performance analysis. 

Fig. 3 shows throughput that MH is in handover. The proposed algorithm has packet 
loss less than the general TCP. Because it transmits buffering packet to MH through 
old BS during the handover. The proposed method shows that the more handover 
time takes long, all throughput decrease, and staying time in cell take time more, the 
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more throughput decrease. Also, the more holding time in cell and handover time take 
long, the more performance has difference. 




Buffering System 



Fig. 2. Network model 




Fig. 3. Comparison of handover time 







690 



D.C. Lee, H.-J. Kim, and J.Y. Koh 



5 Conclusions 

In this paper we propose new handover algorithm that improve the TCP performanee 
degradation in handover. When starting handover, upon reeeipt of an explieit hand- 
over notifieation, the souree host enters Persist mode. This way, data transmissions at 
the souree host during handover are frozen. In numerieal result, the proposed algo- 
rithm show that TCP performanee provides a little performanee improvement more 
than previous algorithm, and expeet to greater performanee improvements while 
having more handovers. 
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